Installation

Static Badge Static Badge Static Badge

Table of Contents

(back to main documentation)

1.1. Download and unzip the complete set of installation and CLI scripts

We have enabled a simple tool to directly download all the required script to your local computer, HPC node or cloud VM.

Please run the following command in a bash terminal window:

bash <(curl -s https://app.biomodal.com/cli/download)

This script requires you to log in with your existing biomodal username and password.
If you do not have a biomodal account please contact us at support@biomodal.com

To reset the password for your account, please use this link https://app.biomodal.com/auth/pages/reset-password

The installation script require the following software modules:

Dependency Details
bash version 4 or higher
jq (version > 1.6) For JSON parsing
curl Used for downloading software from the biomodal website
Google Cloud CLI This should be a recent version that supports the gcloud storage command, including the required gcloud-crc32c libraries

For the complete list of software dependencies, please refer to the section Installing pipeline and CLI software dependencies for details regarding dependencies on your platform.

On completing the curl command above, you will have downloaded the complete set of scripts as a zip file.

You have to unzip this file before you can start using the CLI scripts. Please note that you can use different tools to unzip the file, the example below use unzip but you can choose other tools, like jar xvf, to uncompress the file:

unzip biomodal.1.1.3.zip

You can now change directory to the unzipped biomodal folder and continue with the rest of the steps outlined in this guide.

1.1.1. Authentication tokens

Please note that the authentication process will add a new file to control the tokens required to authorise access to biomodal resources. The default location of this file is in the users $HOME directory and is called .biomodal-auth.json.

Each user running the CLI will have their own unique ${HOME}/.biomodal-auth.json file.

1.1.2. Shared log events

Downloading and running the biomodal CLI script requires you to log in at least once with your existing biomodal username and password. The CLI can optionally share runtime event notifications with the biomodal API service for the purpose of providing customer support.

The current event codes are being reported to the biomodal API services:

Event code Description Optional?
100 Downloaded the biomodal CLI package, via the biomodal API No
101 Completed running the biomodal init command Yes
102 Completed running the biomodal test command Yes
103 Completed running the biomodal analysis command Yes
104 Completed running the biomodal dmr_call command Yes
105 Completed running the biomodal reference make command Yes
200 Customer prefer to automatically share events and metrics reports. Recorded once during the biomodal init step Yes
201 Customer prefer to not share events and metrics reports automatically. Recorded once during the biomodal init step Yes
202 User from the biomodal CRM system has registered for a software download and authentication account, via biomodal API No

Please note that the event codes are not shared with any third party and are only used for internal support purposes. The CLI version and the event code is shared via the biomodal API and will be registred against the user’s account in our internal CRM system.

If you do not wish to share the optional events, please decline event and metrics report sharting during the biomodal init stage, or manually set share_metrics: false in the biomodal CLI configuration file, config.yaml. Please see section “1.4.2.2. CLI config file” below for more information.

If you opt out of sharing events and metrics reports, the CLI will not share any events or metrics reports. However, you will still have to authenticate with the biomodal API service to download biomodal software, test data and reference data.

Nextflow will automatically check for new software versions. To disable this feature please additionally set the NXF_DISABLE_CHECK_LATEST enviroment variable to true in the shell environment where you plan to run the duet pipekine. This will disable the automatic version check for the Nextflow software.

export NXF_DISABLE_CHECK_LATEST=true

If you do not have a biomodal account or like to reset your password, please contact us at support@biomodal.com

1.1.3. Required permissions for cloud tenancies

We recommend that a least privilege approach is taken when providing users with permissions to create cloud resources. The cloud specific examples below demonstrate minimum required permissions to bootstrap resources for aws and
gcp environments.

AWS

We recommend someone with the AdministratorAccess policy attached to their user account runs ./bimodal-cloud-utils create aws due to the number of resources created.

bimodal publish a specific AWS Terraform module in a GitHub repository located here: https://github.com/cegx-ds/terraform-aws-bootstrap.git. Organisations may not allow public GitHub repositories to be accessed without “whitelisting” these first. If you encounter an error similar to the one below, please contact your IT department to allow access to this repository by running the command referenced below. Make sure you replace the /biomodal/ reference with the current installation location you are using for the biomodal CLI.

Could not download module "bootstrap" source code from "git::https://github.com/cegx-ds/terraform-aws-bootstrap.git ": error
downloading 'https://github.com/cegx-ds/terraform-aws-bootstrap.git':  /usr/bin/git exited with 128: fatal: detected dubious ownership in repository at '/biomodal/terraform/aws/.terraform/modules/bootstrap'

To add an exception for this directory, call:

git config --global --add safe.directory /biomodal/terraform/aws/.terraform/modules/bootstrap

GCP

We recommend using GCP’s predefined roles. These roles are created and maintained by Google so you do not need to create custom roles.
Below are the recommended predefined roles you require access to create cloud resources for running this CLI.

Role Name Purpose Required
roles/storage.admin If a new bucket is required No (unless new bucket is specified)
roles/artifactregistry.writer Create required artifact registry Yes
roles/iam.serviceAccountCreator Creating required service account(s) Yes
roles/compute.admin Creating required compute resources and instance template Yes
roles/iap.admin Allowing IAP access to VMs Yes
roles/resourcemanager.projectIamAdmin To assign project wide roles to service accounts Yes
roles/storage.legacyBucketWriter Allow read/write permission on existing bucket No (Required if wishing to use an existing bucket)

Conditions and custom roles can also be created for more granular access restriction, however this is beyond the scope of this document.

1.1.4. Software dependencies

Please refer to the section Installing pipeline and CLI software dependencies for details regarding software dependencies on your platform.

(back to main documentation) | (back to top)

1.2. Choosing the correct installation script

Installation scripts

(back to main documentation) | (back to top)

1.3. Setting up the pipeline on cloud VMs – Using the biomodal-cloud-utils script

We have created a bootstrap tool to assist with the creation of your cloud resources required to run the duet pipeline with Nextflow. The 3 currently supported clouds are gcp, aws and azure.

The biomodal-cloud-utils utility is intended for DevOps / System administrators that create and maintain Cloud tenancies. It utilises Terraform as the infrastructure and code software tool that will create all the compute resources needed to run biomodal pipelines with Nextflow. Please see the respective documentation for your cloud provider for more info, referenced later in this guide.

Please follow the following steps from a Linux compatible computer:

  1. Install Terraform – see this link for OS-specific instructions.

  2. Make sure your cloud admin has given your account enough privileges to create cloud resources over API (see required permissions).

  3. Authenticate Terraform by your cloud provider:

    3.1. Authenticate Terraform – GCP

    3.2. Authenticate Terraform – AWS

    3.3. Install Azure CLI and sign in to your Azure account

  4. Execute the following command to create a new cloud environment from the provider of choice:

    ./biomodal-cloud-utils create <cloud> # gcp / aws / azure

    Note: If the resource creation process is interrupted, please use the destroy command before attempting to use create again.

  5. Execute the following command to connect to the resources created in the above steps:

    ./biomodal-cloud-utils connect <cloud> # gcp / aws / azure

    Note: Please wait 4-5 more minutes after the create command has completed to allow the cloud resources to fully initialise.

    You can now continue with section 1.5. Import the duet multiomics solution pipeline using the biomodal CLI and perform a test run.

  6. To destroy the resources you created, execute:

    ./biomodal-cloud-utils destroy <cloud> # gcp / aws / azure

    WARNING: Destroying your resources is a permanent action, only run this command when you are sure you no longer need them. Non-empty buckets should not be deleted.

The following command is designed for testing your Terraform config and can be use for CI-CD actions:

  ./biomodal-cloud-utils test <cloud> # gcp / aws

You can also pass in the --remote argument to automatically generate a remote backend.

  ./biomodal-cloud-utils test <cloud> --remote

You can provide variables to the test command via Terraform environment variables (i.e. TF_VAR_foo). You can also pass in backend remote config via environment variables.

Please note that the cloud VM will be created with a public IP to allow for data ingress/egress. Please update this as required after software install, authentication, configuration and testing.

(back to main documentation) | (back to top)

1.4. Setting up the pipeline on HPC clusters

Please do not attempt installing the CLI or duet pipeline without extensive knowledge of the local Linux HPC environment. We recommend that you contact your local HPC admin or IT support before attempting to install the CLI or duet pipeline. Restrictive quotas on local disk resources and software install limitations are likely to cause issues during installation and update operations.

1.4.1. Using the biomodal bootstrap on HPC tool

We have created a bootstrap tool to assist with installing required software and downloading the pipeline on an HPC cluster, or standalone server. Currently, Ubuntu and CentOS/RHEL systems are supported. Other Linux distributions may work fine providing they use apt-get or dnf package managers, but we have not tested them.

The biomodal-hpc-utils-admin script will install the required OS packages and software locally to run a Nextflow pipeline or workflow with sudo access. The suggested HPC setup will use Singularity as an alternative to Docker. Please note that Singularity does not enable nor propagate root privileges to the Singularity images.

NB!! Before you execute the biomodal-hpc-utils-admin script, make sure your account has sudo privileges. The biomodal-hpc-utils-admin script will require sudo access and must be reviewed by an HPC admin before executing as every HPC installation will require different settings. This script will create all the local folders and settings files for the pipeline, install required OS & system software, and download the duet pipeline software.

There is an alternative script to install software without using root or sudo access. This script is called biomodal-hpc-utils-conda. This should not be used by a non technical user as extensive knowledge of the local Linux HPC environment is required.


Please review each step in the biomodal-hpc-utils-admin and biomodal-hpc-utils-conda script(s) before you run it on any HPC cluster or local machine. Do not assume that all steps in these scripts will be an exact match for your local server or HPC nodes. If you choose to use the biomodal-hpc-utils-conda script, we recommend you use an existing working conda environment and have the required permissions to install software in that environment.


Executing the biomodal-hpc-utils-admin or biomodal-hpc-utils-conda scripts will require in-dept knowledge of the local Linux HPC cluster environment and should not be attempted by a non technical user.

Executing the scripts will ask to confirm that you have read the documentation and understand the implications of running this script on your system.

Please confirm that you have read the documentation and understand the implications of running this script on your system.
Are you sure you want to run this script? (y/n)

The rest of section 1.4 in this guide will only refer to the biomodal-hpc-utils-admin script. We recommend using a terminal multiplexer to manage your persistent shell sessions, such as tmux to ensure long running sessions are not terminated.

To initiate installation, execute the following command:

./biomodal-hpc-utils-admin

The script will configure software and create two local files that the main biomodal tools will use to local data directories and other resources. Please follow the following prompts and select accordingly:

Enter cli installation path. : /biomodal

You can leave this as default or select an installation path of your choosing.

Enter data bucket path - shared location where ref data and images will be stored. : /biomodal/data_bucket

You can leave this as default or select a data storage path as default storage directory for reference data, input, and output.

Choose relevant geographical location of the biomodal docker registry (1-3)?
1) asia
2) europe
3) us
#?

You need to select a region, from which biomodal GCP resources will be pulled from.

Current queueSize is: 200
This is the maximum number of concurrent jobs that the pipeline will attempt to launch
Would you like to change it? (y/n)

You can limit the maximum number of concurrent Nextflow jobs that the duet pipeline will attempt to launch. This is useful for HPC cluster or Cloud projects with limited resources.

Which executor would you like to use? (choose 1-4)
1) slurm
2) lsf
3) sge
4) local
#?

Please select the Nextflow executor of choice; if you are not using a HPC cluster scheduler, please select local.
The local executor setting will still require a server with significant memory, CPU and disk space to run the pipeline.

We recommend using terminal multiplexer to manage your persistent shell sessions, such as tmux. This is due to the potentially long running nature of pipelines.

When the script runs to completion, please log out and back in again to activate any new environment variables.

If you want to edit the config and/or Nextflow files manually, please see details below.

(back to main documentation) | (back to top)

1.4.2. Installing duet pipeline and software manually

1.4.2.1. Installing pipeline and CLI software dependencies

If you are using an unsupported OS, do not have sudo access or are unable to use the biomodal-hpc-utils-admin tool, please install and/or check if the following dependencies are installed in your environment:

Dependency Details
bash version 4 or higher
JAVA version 8 or higher
Nextflow A recent version is recommended, tested and supported from 21.06 onwards
Google Cloud CLI Please ensure you have a recent version as this is used for software installation and authentication. Only required on the nodes that install the pipeline and workflow software, is not part of running the duet pipeline.
Singularity Aka Apptainer. Please note that Apptainer is only supported by more recent versions of Nextflow, so we recommend checking if your version supports Singularity and/or Apptainer
tar, unzip Used to uncompress the duet pipeline, reference data and CLI software from biomodal
rsync Copies files internally on HPC platforms
pcre2grep Used to search for patterns in text files
curl Download software from biomodal.com
jq For JSON parsing
Other OS support packages: gnupg, lsb-release, apt-transport-https, gnupg, ca-certificates

Some installations may additionally require Google SDK components to ensure integrity of data being downloaded, this can be installed using: gcloud components install gcloud-crc32c

If you are missing any of the above, please contact your HPC or IT administration for installation/enablement of these packages.

You can refer to the biomodal-hpc-utils-admin script as guide to install these packages on different platforms.

1.4.2.2. CLI config file

The biomodal CLI script relies on a config file called config.yaml. The biomodal-hpc-utils-admin script will add this file to the default location, /biomodal/, or another user provided location. This config file is keeping track of the current version of the pipeline (duet_version) and the pipeline data location (directory or bucket_url).

If you are not using the biomodal-hpc-utils-admin, please create the file with the following content.

platform: hpc
duet_version: 1.4.1
ref_pipeline_version: 1.0.1
bucket_url: <default bucket/folder location of your pipeline data input / output>
work_dir: <folder location/path where pipeline temporary data is stored>
init_folder: <folder location/path where duet pipeline software is stored>
biomodal_registry: <biomodal_registry, see below>
biomodal_api_baseurl: https://app.biomodal.com
biomodal_releases_bucket: gs://cegx-releases
biomodal_reference_bucket: gs://release-reference-files
share_metrics: false
error_strategy: normal

init_folder should point to the location of the duet pipeline and D(h)MR workflow software folder and work_dir should point to the temporary/scratch location for the duet pipeline. if work_dir is not added, the default <bucket_url>/nf-work/ location will be used when running the duet pipeline.

Please note that the location you choose for work_dir should have about 6 to 7 times the size of the data in your --input-path location in free space to avoid running out of disk space. Additionally, if the work_dir is located on the same disk as the location you choose for the --output-path and --input-path parameters, you should aim to have about 10 times the input space in free space before you start the analysis.

The biomodal_registry should be one of the following:

  • asia-docker.pkg.dev/cegx-releases/asia-prod
  • europe-docker.pkg.dev/cegx-releases/eu-prod
  • us-docker.pkg.dev/cegx-releases/us-prod

share_metrics should be true or false. If set to true, events and the duet pipeline metrics report will be shared with biomodal after a successful analysis run. If set to false, events and the duet pipeline metrics report will not be shared with biomodal after a successful analysis run.

Please make sure that the biomodal script is in the same location as the config files.

1.4.2.3. Nextflow config file

Nextflow relies on the nextflow.config file in the biomodal script folder so that users can adjust the process section to match any local HPC scheduler and queue setup.

This local configuration file is applied after any settings defined in the standard duet pipeline configuration files. Nextflow will automatically cascade the settings in the duet pipeline configurations, followed by the nextflow.config in the biomodal script folder. This ensures that your customised settings are always retained in the nextflow.config file in the biomodal script folder when upgrading to any new releases of the duet pipeline.

If you are not using the biomodal-hpc-utils-admin script, please create the nextflow.config file with the following content.

singularity {
  enabled    = true
  autoMounts = true
  libraryDir = “<the location you used in the config.yaml file>"
}
params {
  registry = "<biomodal registry from choices below>"
}
process {
  executor = "slurm"
}
report {
    overwrite = true
}

Note: Please fill the “registry” value under “params” with one of the following:

"asia-docker.pkg.dev/cegx-releases/asia-prod"
"europe-docker.pkg.dev/cegx-releases/eu-prod"
"us-docker.pkg.dev/cegx-releases/us-prod"

Note: Please fill the “process” section according to your queue setup (select local if no HPC scheduler is present) to one of the following:

process {
  executor = "slurm"
}
process {
  executor = "lsf"
}
process {
  executor       = "sge"
  process.penv   = "smp"
  clusterOptions = "-S /bin/bash"
  //If your system is not supprting h_rt/h_rss/mem_free settings, you can try to use the following settings
  //clusterOptions = { "-l h_vmem=${task.memory.toString().replaceAll(/[\sB]/,'')}" }
}
process {
  executor = "local"
}

These settings should be adjusted to match your HPC setup. For more advanced settings, please see 7. HPC Recommendations

(back to main documentation) | (back to top)

1.5. Import the duet multiomics solution pipeline using the biomodal CLI and perform a test run

After you have created a new cloud or HPC environment, you can start using the biomodal CLI tool.

To see all the command and options you can use with the biomodal CLI tool, simply run biomodal with no parameters.

The biomodal CLI will:

  1. Import the latest version of the duet pipeline and D(h)MR workflow.
  2. Import required reference files, for both the duet pipeline and Reference Generation pipeline.
  3. Import the biomodal test data set, aka. ‘RunXYZ’, to make sure all the resources created manually or with Terraform work as expected.
  4. Synchronise all required biomodal Docker images from our host repository to your project repository.
  5. Enable you to update to a new version of the duet pipeline and reference data when available.
  6. Do a test run with biomodal test data set to make sure everything works as expected with Nextflow and configuration.

Note: When using a cloud environment, please make sure that you can launch Terraform and be authenticated with your cloud provider either from your local machine, a Cloud Shell, or existing VM in your cloud provider (see section 1.2.).

Note: The CLI scripts and required configuration files are available in the /biomodal folder. This is also added to your $PATH variable (when biomodal-cloud-utils or biomodal-hpc-utils-admin are executed). If you used an alternative location on HPC, please refer to the biomodal using the full path to its location.

On your VM or HCP command line, run biomodal init

This will transfer the pipeline, reference files, test data set, and Docker images (Singularity on HPC) to the data location specified in the config.yaml file. If the data download process is interrupted, please remove data under the reference_files folder and re-run biomodal init.

The duet pipeline will store temp files in a separate cache directory. This is a folder that will typically contain a larger amount of files and can be cleaned out after successful analysis runs.

The duet pipeline temp directory is currently set as <current temp folder>.
Press Enter to continue using this directory or enter a new one here: <new temp folder>

You will be asked if you want to automatically upload the pipeline metrics report to biomodal after a successful analysis run. This will enable biomodal to compare your pipeline setup with the biomodal internal pipeline for the purpose of quality control. This will share the metrics CSV report and the parameters you used to run the pipeline, no other data will be shared.

If you do not want to share the report automatically, you can manually run biomodal report <metrics report file name> <pipeline json parameters file name> to share a specific metrics report at any later time.

Would you like to share the pipeline metrics report at the end of all successful analysis runs with biomodal? (1-2)
1) Yes
2) No
#?

You can specify which location you prefer the duet pipeline software to be installed to. This may be in your $HOME folder or on a shared software filesystem. Please note the directory must be created before running biomodal init.

The duet software location 'init_folder' property is not currently defined in the configuration file
Would you like to set it to /<home_folder>? (1-2)
1) Yes
2) No
#?

You will now be asked about the duet pipeline error strategy you would like to use:

  • Normal: Retries failed jobs up to 10 times depending on the exit status. This option will retry all failed jobs and continue with other samples should there be a problem with a specific sample.
  • FailFast. Retries failed jobs only once or exits immediately depending on the exit status and the executor.
Please select which duet pipeline error strategy you would like to use:
1) Normal.
2) FailFast.
#?

When biomodal init has completed, you can run biomodal test to test the pipeline using biomodal test data (Aka. “RunXYZ”).

The test run is expected to complete similar to the below example.

Completed at: 10-Mar-2023 12:51:33
Duration    : 45m 15s
CPU hours   : 10.0
Succeeded   : 27

(back to main documentation) | (back to top)

1.6 Testing the duet pipeline on a larger dataset

The biomodal test command will verify that all the software necessary to run the duet pipeline works as expected on a small test sample. However, it can be useful to run another test on a larger dataset, which is more representative of the scale of data generated by a biomodal assay. To perform this larger test, we will use the publicly available Genome In A Bottle (GIAB) data published by biomodal.

Firstly, go to the demo dataset instructions, and follow the steps in the section entitled “Download Instructions”, using the --raw-fastq option to download fastq files just as they would look when generated by a sequencing instrument. You can either download the modc or evoc sample data – either will provide a good test of the duet pipeline – just make sure you select the appropriate option when running the biomodal analyse command.

Then, assuming you have downloaded the raw fastqs to a directory called $BIOMODAL_PATH/giab_data/nf-input, run the following command:

biomodal analyse \
  --input-path $BIOMODAL_PATH/giab_data \
  --meta-file CEGX_Run_meta.csv \
  --output-path $BIOMODAL_PATH/output \
  --additional-profile deep_seq \
  --tag giab_demo_data \
  --mode 5bp

Note that in this case, the --mode 5bp parameter has been used – this needs to match the type of data you downloaded.

(back to main documentation) | (back to top)

1.7 nf-core community pipelines

The nf-core community has a number of publicly available Nextflow pipelines. The nf-core pipelines are designed to be easy to use and are well documented. You can find a list of the nf-core pipelines at https://nf-co.re/pipelines. Some nf-core pipelines may have a similar technical setup as your pipeline execution enviroment, so it is worth checking if there are relevant Nextflow pipeline settings and parameters you can apply to your nextflow environment.

(back to main documentation) | (back to top) | (Next)

Cambridge Epigenetix is now biomodal