Usage

Static Badge Static Badge Static Badge

Table of Contents

(back to main documentation)

Now that we have made sure biomodal CLI runs as expected with test data set, you are ready to use your own data sets. You can create more buckets or transfer your data to the bucket/location created above. We would recommend making a folder for your data outside of the test data location used during setup.
To run the duet pipeline on your own sample FASTQ files, you can use the biomodal analyse command. You can use parameters with your own input & output locations.

Please do not attempt to run biomodal test or biomodal analyse if you experienced any problems during the bootstrapping or biomodal init process.

biomodal init must always be successfully completed after a fresh install and after any manual update of the CLI and/or duet pipeline.

Please review the Elevated resource profiles section to ensure you allocate sufficient hardware resources relative to the number of reads per sample before running the duet pipeline.

2.1. Command Structure and Usage

Usage: Simply invoke the biomodal CLI without parameters to see all available command options.

biomodal v1.1.3
usage: biomodal [SUBCOMMAND] [OPTION]...

biomodal command line interface tools.

SUBCOMMANDS:
  auth                                   Login to biomodal
  init                                   Download current default version of the duet pipeline, docker images and reference data
  test                                   Run the duet analysis pipline to validate technical setup only
  info                                   Display latest online versions and current local versions of duet software and reference data

  list                                   List duet pipeline versions available for download
  download <version>                     Download a given version of duet pipeline, with related Docker images and reference data

  analyse                                Analyse your samples using duet pipeline
    --input-path <input_path>              Required: custom bucket url to directory containing all fastq.gz files to be analysed
    --output-path <output_path>            Required: custom output disk location or bucket url
    --tag <tag>                            Required: custom tag to identify this analysis run
    --meta-file <meta_file_name>           Optional: name of a custom meta file residing in the 'input-path' directory, e.g. 'CEGX_Run_meta.csv'
    --additional-profile <*_seq>           Optional: additional configuration profile to use, e.g. 'deep_seq' or 'super_seq'
    --run-name <run_name>                  Optional: custom run name to identify this analysis run
    --targeted <true|false>                Optional: enable/disable the pipeline in targeted mode, default is 'false'
    --targeted-panel <panel name>          Optional: targeted panel 'twist_methylome' or 'twist_cancer', (NB! 'targeted' must be 'true')
    --compute-asm <true|false>             Optional: enable/disable Allele Specific Methylation (ASM), default is 'false'
    --chg-chh-contexts <true|false>        Optional: enable/disable CHG and CHH modification calling, default is 'false'
    --use-gvcf <true|false>                Optional: enable/disable joint variant calling and joint ASM calling, default is 'false'
    --mode <5bp|6bp>                       Optional: 5 base vs 6 base mode, default is '5bp'
    --reference-genome                     Optional: use an alternative reference genome
    --reference-genome-profile             Optional: full path to bespoke reference genome profile you have created using the 'reference make' command
    --quantification-output <quant_output> Optional: output format for quantification files, combination of 'bedmethyl,cxreport,bismark,bedgraph', default is 'cxreport'
    --additional-params <params>           Optional: additional comma separated parameters, e.g. 'param1=value1,param2=value2'
    --resume                               Optional: resume previous run using successfully completed tasks, default is 'false'
    --work-dir                             Optional: override the default path where the workflow temporary data is stored

  validate                               Validate your parameters before you run the duet pipeline (analyse command)
                                           Parameters: Same as for the 'analyse' command listed above

  call_dmr                               Run D(h)MR analysis
    --input-path <input_path>              Required: full custom bucket uri or directory path with Zarr stores to be analysed
    --dmr-sample-sheet <sample_sheet_name> Required: full path and name of the DMR sample sheet  
    --output-path <output_path>            Required: custom output disk location or bucket uri
    --tag <tag>                            Required: custom tag to identify this DMR run
    --mode <5bp|6bp>                       Required: 5 base vs 6 base mode, default is '5bp'
    --condition <condition(s)>             Required: a single DMR sample sheet column that contains the conditions between which to call DMRs
    --covariates <covariates>              Optional: one or more DMR sample sheet columns that contain covariates to account for during DMR calling
    --dmr-bed-path <dmr_bed_path>          Optional: a path to a bed file defining regions that DMR calling should be restricted to
    --evoc-modifications <mc|hmc|modc>     Optional: single or comma separated list of duet evoC modifications to call, 'mc' or 'hmc' in 6bp mode only
    --min-depth <min_depth>                Optional: contexts will only be removed if coverage <= min-depth in ALL SAMPLES, default is '0'
    --window-size <window_size>            Optional: window size for DMR analysis, default is '1000'
    --additional-params <params>           Optional: additional single or comma separated parameters, e.g. 'param1=value1,param2=value2'
    --run-name <run_name>                  Optional: custom run name to identify this analysis run
    --resume                               Optional: resume previous run using successfully completed tasks, default is 'false'
    --work-dir                             Optional: override the default path where the workflow temporary data is stored

  reference list                         List biomodal reference data versions available for download
  reference download <version>           Download a version of biomodal reference data

  reference pipeline list                List biomodal reference pipeline versions avaiable for download
  reference pipeline download <version>  Download the biomodal reference pipeline to run locally
  reference make                         Make new alternative reference genome
    --input-path <ref_genome_path>         Required: full path for the reference genome gzipped FASTA file
    --output-path <ref_dir>                Required: custom output disk full path or bucket url
    --species <species>                    Required: reference species, e.g. 'Homo_sapiens', 'Mus_musculus'
    --reference-genome <ref_genome>        Required: Genome Reference Consortium official name, e.g. 'GRCh38Decoy', 'GRCm38p6'

  report <csv report> <params.json>      Send a specific duet metrics report to biomodal

(back to main documentation) | (back to top)

2.2. Input file requirements

File structure

The --input-path folder provided for the duet pipeline must contain the nf-input folder and the meta csv file, as per this example:

   my_input_path
    ├── biomodal_run_meta.csv
    └── nf-input
        ├── CEG93-01_S12_L001_R1_001.fastq.gz
        └── CEG93-01_S12_L001_R2_001.fastq.gz

For the D(h)MR workflow, please note that the --input-path parameter should point to the full path of the relevant Zarr stores.

FASTQ files

Filename requirements

The pipeline requires gzipped lane-wise (not lane-merged) FASTQ files with filenames that satisfy the naming convention used in BaseSpace, i.e.

{sample-id}_{sample-number}_{lane}_{R1|R2}_001.fastq.gz

The pipeline makes of use of underscores as delimiters separating fields in the filename, so it is important that underscores are not used inside any fields, such as the sample IDs (note that BaseSpace automatically converts underscores in sample IDs into dashes). Please make sure your sample IDs do not start with a number

Note that the pipeline disregards the {sample-number} field (which is usually allocated automatically by BaseSpace based on the order of samples in your sample sheet); however, the sample number field must be present in the filenames because the pipeline tokenises the filenames using underscores as delimiters and then determines the sample-id, lane and R1/R2 as the first, third and fourth fields in the filename.

Here is an example of a set of FASTQ filenames that meets the naming convention requirements (in this case, two samples, each pooled across two lanes):

CEG900-123-678_S1_L001_R1_001.fastq.gz  CEG900-123-678_S1_L001_R2_001.fastq.gz
CEG900-123-678_S1_L002_R1_001.fastq.gz  CEG900-123-678_S1_L002_R2_001.fastq.gz
CEG900-123-679_S2_L001_R1_001.fastq.gz  CEG900-123-679_S2_L001_R2_001.fastq.gz
CEG900-123-679_S2_L002_R1_001.fastq.gz  CEG900-123-679_S2_L002_R2_001.fastq.gz

FASTQ sequence identifier requirements

The format of the sequence identifiers in FASTQ files can differ depending upon the software that was used to generate them. The pipeline requires the read identifiers in FASTQ files to comply with the format of the following example:

@A00536:706:HJNLHDRX3:1:2101:2718:1031 1:N:0:ACGGAACA+ACGAGAAC
Example Data Description
A00536 Instrument name
706 Run ID
HJNLHDRX3 Flowcell ID
1 Flowcell lane
2101 Tile number
2718 x-coorinate
1031 y-coordinate
1 Member of a pair
N Filtered
ACGGAACA+ACGAGAAC Index sequences

Use of the inferred instrument type

The pipeline will infer the instrument type from the instrument name extracted from the first read in each FASTQ file based on the following convention:

Read ID begins with Instrument inferrred as
@A NovaSeq6000
@LH NovaSeqX
@VL NextSeq1000
@VH NextSeq2000
Anything else Unknown

If the instrument type can be inferred, then an empirical q-table specific to that instrument type will be used to resolve Phred quality scores during the process of read resolution. If no instrument type can be inferred, then Phred quality scores will be resolved by selecting the lesser of the two associated Phred quality scores.

Use of the extracted flowcell ID and flowcell lane

The sample ID extracted from the FASTQ filename and the flowcell ID and flowcell lane extracted from the read ID of the first read in the FASTQ file will be used to construct a Read Group which will get passed into the aligner during the alignment step. The following example shows the Read Group generated for sample ID CEG900-123-678, flowcell ID HJNLHDRX3 and flowcell lane 1:

@RG  ID:HJNLHDRX3.1  PL:ILLUMINA  PU:HJNLHDRX3.1.CEG900-123-678  LB:CEG900-123-678  SM:CEG900-123-678

Sequencing run metadata file

The pipeline can use an optional metadata file in csv format, with one column per sample. The contents of cell A1 should be the text “sample_id” ; the remaining cells in row 1 should contain sample IDs; the remaining cells in column A should contain metadata field names (there are no requirements about what these field names are, nor the quantity of them). The pipeline assumes that this file will be encoded in ASCII and that fields and field names will not contain any commas.
Here is an example of the content of a sequencing run metadata file that meets the requirements:

sample_id,CEG900-123-678,CEG900-123-679,CEG900-123-680,CEG900-123-681
sample-condition,case,control,case,control
sample-sex,male,male,female,female
lab-technician,JM,JM,ST,ST
project-id,X001,X001,X002,X002

The metadata file must be a csv file located in the --input-path folder as described in the “File structure” section above. The filename can be anything, but the --meta-file argument must match exactly to the name of the metadata file. Please only use the metadata filename, not the full path.

(back to main documentation) | (back to top)

2.3. Shell session persistence

We strongly recommend to use a persistent shell session like tmux to avoid analysis runs timing out.
If you run tmux you can reconnect to it if the network connection is lost.

(back to main documentation) | (back to top)

2.4. Expected output

The following is a (truncated) example of what the output looks like after the biomodal analyse command has been executed and Nextflow has started to run the the pipeline:

 N E X T F L O W   ~  version 24.04.2

Launching `/biomodal/biomodal-duet/main.nf` [trusting_magritte] DSL2 - revision: 9c351f91dd
executor >  google-batch (3)
[-        ] resolve_align:FASTQC_RAW                        -
[34/e1230c] resolve_align:CUTADAPT (CEG9330132-19-01, L001) [100%] 1 of 1 ✔
[-        ] resolve_align:FASTQC_TRIMMED                    -
[e8/bf921e] resolve_align:COUPLET (CEG9330132-19-01, L001)  [  0%] 0 of 1
[-        ] resolve_align:FASTQC_RESOLVED                   -
[-        ] resolve_align:BWA_MEM2                          -
...
...

2.4.1. Output location structure overview

Outputs from the Nextflow pipeline get published to your cloud storage bucket or output directory, organised into a directory structure.

At the top level, in your output location, you will have a directory for each sequencing run or dataset that you process, for example:

gs://my-org-nextflow/test-runxyz

At the next level down, Nextflow files are organised into subdirectories serving the following purposes:

Subdirectory Purpose
nf-input This is where the pipeline will look for input FASTQ files. FASTQ files should be copied into this directory prior to launching the pipeline. Please note that all FASTQ files undergo analysis regardless of sample info in the meta file.
nf-work This is the working directory for Nextflow. Logs and files staged for downstream processes will get stored here. At the next level down, there will be a directory that includes the pipeline version and the tags that you set when launching the pipeline. Beneath that, the directory structure will comprise of hashes which match those displayed in Nextflow Tower and which uniquely identify specific jobs launched by Nextflow. For example: gs://my-org-nextflow/test-runxyz/nf-work/duet-1-0.1.0_2021-05-15_1656_5bp/01/af2b9a7434a6ca435b96c6b84cb9a2 This directory is useful for debugging the pipeline, examining logs and viewing the STDOUT from jobs. Inside this directory there will be subdirectories associated with each pipeline execution. If the pipeline is running smoothly on your runs, you will rarely need to look in the nf-work directory. If a pipeline has completed successfully and you have no intention of resuming it with modified settings or examining logs, you can delete the contents of the associated subdirectory inside the nf-work directory.
nf-results This is the directory where the outputs of the pipeline are stored, organised by pipeline run, sample, pipeline module and lane. This directory is described further below.

2.4.2. Exploring the nf-results subdirectory

The biomodal pipeline organises data into the following top-level directory structure in the nf-results subdirectory

Subdirectory Contents
reports Sample-level and multi-sample reports summarising information about the samples and controls
sample_outputs Primary data files generated by the pipeline (described in more detail below)
controls BAM files and quantification files associated with the methylated lambda and unmethylated pUC19 controls. These small files are analogous to the BAM files and quantification files generated for your samples, and may be useful for familiarising yourself with the file formats. Note that there is an accompanying FASTA file for the controls in the reference file directory with the following name/location: ss_ctrls/v24/ss-ctrls-long-v24.fa.gz
diagnostics Secondary outputs from the pipeline including a parameters log recording the parameters that were used to execute the pipeline, more extensive metrics to support more detailed data investigations and the interim resolved FASTQ files that were passed into the aligner

2.4.3. Key pipeline outputs

  • In the reports/summary_reports subdirectory, you’ll find an aggregated summary report in Excel Format, which collates metrics from across all modules in the pipeline. This file has the Pipeline_Summary sheet containing the most useful metrics output by the pipeline.
  • In the reports/sample_reports subdirectory you’ll find a .html report for each sample containing assay-specific quality-related plots.
  • Deduplicated genome-aligned BAM files are located in the sample_outputs/bams subdirectory.
  • Stranded CpG methylation quantification files are located in the sample_outputs/modc_quantification subdirectory and have the file extension *.CG_quant.modc.bed.gz.
  • Variant calling outputs are in the sample_outputs/variant_call_files subdirectory in .vcf format.

For a more detailed overview of the duet pipeline outputs and explanation of the formats, please see the Bioinformatics data interpretation guide.

2.4.4. Removing data from the temporary pipeline folders

The temporary pipeline folders created by the duet pipeline can be removed upon successful completion of the pipeline run. If a pipeline has completed successfully and you have no intention of resuming it with modified settings or examining logs, you can delete the contents of the associated subdirectory inside the nf-work directory. Please see 2.4.1. Output location structure overview for more details about the pipeline folder structure.

(back to main documentation) | (back to top)

2.5. Examples of biomodal analyse commands

Here we have included a few exmples of how you can run the biomodal analyse command with different parameters.

2.5.1. Running the test XYZ dataset on a local single node

biomodal analyse \
  --input-path /biomodal/data_bucket/test-runxyz \
  --meta-file CEGX_Run_meta.csv \
  --output-path /biomodal/data_bucket/test-runxyz/nf-results \
  --run-name CEGX_RunXYZ \
  --tag CEGX_RunXYZ \
  --additional-params lib_prefix=CEG9330132-19-01_S12_L001 \
  --mode 5bp 

2.5.2. Running the test XYZ dataset on a local single node with local_deep_seq profile,resume and max 16 CPUs and 32GB of memory

biomodal analyse \
  --input-path /biomodal/data_bucket/test-runxyz \
  --meta-file CEGX_Run_meta.csv \
  --output-path /biomodal/data_bucket/test-runxyz/nf-results \
  --run-name CEGX_RunXYZ_deep_seq \
  --tag CEGX_RunXYZ_deep_seq \
  --mode 5bp \
  --additional-params cpu_num=16,memory="32GB" \
  --additional-profile local_deep_seq \
  --resume

2.5.3. Running on GCP Batch with targeted mode enabled using the twist_cancer panel

biomodal analyse \
  --input-path gs://gcp-bucket/my_data \
  --meta-file CEGX_Run_meta.csv \
  --output-path gs://gcp-bucket/my_data \
  --tag GCP_targeted_test \
  --run-name GCP_targeted_test \
  --reference-genome GRCh38Decoy \
  --targeted true \
  --target_panel twist_cancer \
  --additional-profile deep_seq \
  --additional-params umi=true,r1_r2_switch=true \
  --mode 5bp

(back to main documentation) | (back to top) | (Next)

Cambridge Epigenetix is now biomodal