Table of Contents
- 2.1. Command Structure and Usage
- 2.2. Input file requirements
- 2.3. Shell session persistence
- 2.4. Expected output
Now that we have made sure biomodal CLI runs as expected with test data set, you are ready to use your own data sets. You can create more buckets or transfer your data to the bucket/location created above. We would recommend making a folder for your data outside of the test data location used during setup.
To run the duet pipeline on your own sample FASTQ files, you can use the biomodal analyse
command. You can use parameters with your own input & output locations.
Please do not attempt to run biomodal test
or biomodal analyse
if you experienced any problems during the bootstrapping or biomodal init
process.
biomodal init
must always be successfully completed after a fresh install and after any manual update of the CLI and/or duet pipeline.
Please review the Elevated resource profiles section to ensure you allocate sufficient hardware resources relative to the number of reads per sample before running the duet pipeline.
2.1. Command Structure and Usage
Usage: Simply invoke the biomodal
CLI without parameters to see all available command options.
biomodal v1.1.3
usage: biomodal [SUBCOMMAND] [OPTION]...
biomodal command line interface tools.
SUBCOMMANDS:
auth Login to biomodal
init Download current default version of the duet pipeline, docker images and reference data
test Run the duet analysis pipline to validate technical setup only
info Display latest online versions and current local versions of duet software and reference data
list List duet pipeline versions available for download
download <version> Download a given version of duet pipeline, with related Docker images and reference data
analyse Analyse your samples using duet pipeline
--input-path <input_path> Required: custom bucket url to directory containing all fastq.gz files to be analysed
--output-path <output_path> Required: custom output disk location or bucket url
--tag <tag> Required: custom tag to identify this analysis run
--meta-file <meta_file_name> Optional: name of a custom meta file residing in the 'input-path' directory, e.g. 'CEGX_Run_meta.csv'
--additional-profile <*_seq> Optional: additional configuration profile to use, e.g. 'deep_seq' or 'super_seq'
--run-name <run_name> Optional: custom run name to identify this analysis run
--targeted <true|false> Optional: enable/disable the pipeline in targeted mode, default is 'false'
--targeted-panel <panel name> Optional: targeted panel 'twist_methylome' or 'twist_cancer', (NB! 'targeted' must be 'true')
--compute-asm <true|false> Optional: enable/disable Allele Specific Methylation (ASM), default is 'false'
--chg-chh-contexts <true|false> Optional: enable/disable CHG and CHH modification calling, default is 'false'
--use-gvcf <true|false> Optional: enable/disable joint variant calling and joint ASM calling, default is 'false'
--mode <5bp|6bp> Optional: 5 base vs 6 base mode, default is '5bp'
--reference-genome Optional: use an alternative reference genome
--reference-genome-profile Optional: full path to bespoke reference genome profile you have created using the 'reference make' command
--quantification-output <quant_output> Optional: output format for quantification files, combination of 'bedmethyl,cxreport,bismark,bedgraph', default is 'cxreport'
--additional-params <params> Optional: additional comma separated parameters, e.g. 'param1=value1,param2=value2'
--resume Optional: resume previous run using successfully completed tasks, default is 'false'
--work-dir Optional: override the default path where the workflow temporary data is stored
validate Validate your parameters before you run the duet pipeline (analyse command)
Parameters: Same as for the 'analyse' command listed above
call_dmr Run D(h)MR analysis
--input-path <input_path> Required: full custom bucket uri or directory path with Zarr stores to be analysed
--dmr-sample-sheet <sample_sheet_name> Required: full path and name of the DMR sample sheet
--output-path <output_path> Required: custom output disk location or bucket uri
--tag <tag> Required: custom tag to identify this DMR run
--mode <5bp|6bp> Required: 5 base vs 6 base mode, default is '5bp'
--condition <condition(s)> Required: a single DMR sample sheet column that contains the conditions between which to call DMRs
--covariates <covariates> Optional: one or more DMR sample sheet columns that contain covariates to account for during DMR calling
--dmr-bed-path <dmr_bed_path> Optional: a path to a bed file defining regions that DMR calling should be restricted to
--evoc-modifications <mc|hmc|modc> Optional: single or comma separated list of duet evoC modifications to call, 'mc' or 'hmc' in 6bp mode only
--min-depth <min_depth> Optional: contexts will only be removed if coverage <= min-depth in ALL SAMPLES, default is '0'
--window-size <window_size> Optional: window size for DMR analysis, default is '1000'
--additional-params <params> Optional: additional single or comma separated parameters, e.g. 'param1=value1,param2=value2'
--run-name <run_name> Optional: custom run name to identify this analysis run
--resume Optional: resume previous run using successfully completed tasks, default is 'false'
--work-dir Optional: override the default path where the workflow temporary data is stored
reference list List biomodal reference data versions available for download
reference download <version> Download a version of biomodal reference data
reference pipeline list List biomodal reference pipeline versions avaiable for download
reference pipeline download <version> Download the biomodal reference pipeline to run locally
reference make Make new alternative reference genome
--input-path <ref_genome_path> Required: full path for the reference genome gzipped FASTA file
--output-path <ref_dir> Required: custom output disk full path or bucket url
--species <species> Required: reference species, e.g. 'Homo_sapiens', 'Mus_musculus'
--reference-genome <ref_genome> Required: Genome Reference Consortium official name, e.g. 'GRCh38Decoy', 'GRCm38p6'
report <csv report> <params.json> Send a specific duet metrics report to biomodal
(back to main documentation) | (back to top)
2.2. Input file requirements
File structure
The --input-path
folder provided for the duet pipeline must contain the nf-input
folder and the meta csv file, as per this example:
my_input_path
├── biomodal_run_meta.csv
└── nf-input
├── CEG93-01_S12_L001_R1_001.fastq.gz
└── CEG93-01_S12_L001_R2_001.fastq.gz
For the D(h)MR workflow, please note that the --input-path
parameter should point to the full path of the relevant Zarr stores.
FASTQ files
Filename requirements
The pipeline requires gzipped lane-wise (not lane-merged) FASTQ files with filenames that satisfy the naming convention used in BaseSpace, i.e.
{sample-id}_{sample-number}_{lane}_{R1|R2}_001.fastq.gz
The pipeline makes of use of underscores as delimiters separating fields in the filename, so it is important that underscores are not used inside any fields, such as the sample IDs (note that BaseSpace automatically converts underscores in sample IDs into dashes). Please make sure your sample IDs do not start with a number
Note that the pipeline disregards the {sample-number}
field (which is usually allocated automatically by BaseSpace based on the order of samples in your sample sheet); however, the sample number field must be present in the filenames because the pipeline tokenises the filenames using underscores as delimiters and then determines the sample-id
, lane
and R1/R2
as the first, third and fourth fields in the filename.
Here is an example of a set of FASTQ filenames that meets the naming convention requirements (in this case, two samples, each pooled across two lanes):
CEG900-123-678_S1_L001_R1_001.fastq.gz CEG900-123-678_S1_L001_R2_001.fastq.gz
CEG900-123-678_S1_L002_R1_001.fastq.gz CEG900-123-678_S1_L002_R2_001.fastq.gz
CEG900-123-679_S2_L001_R1_001.fastq.gz CEG900-123-679_S2_L001_R2_001.fastq.gz
CEG900-123-679_S2_L002_R1_001.fastq.gz CEG900-123-679_S2_L002_R2_001.fastq.gz
FASTQ sequence identifier requirements
The format of the sequence identifiers in FASTQ files can differ depending upon the software that was used to generate them. The pipeline requires the read identifiers in FASTQ files to comply with the format of the following example:
@A00536:706:HJNLHDRX3:1:2101:2718:1031 1:N:0:ACGGAACA+ACGAGAAC
Example Data | Description |
---|---|
A00536 |
Instrument name |
706 |
Run ID |
HJNLHDRX3 |
Flowcell ID |
1 |
Flowcell lane |
2101 |
Tile number |
2718 |
x-coorinate |
1031 |
y-coordinate |
1 |
Member of a pair |
N |
Filtered |
ACGGAACA+ACGAGAAC |
Index sequences |
Use of the inferred instrument type
The pipeline will infer the instrument type from the instrument name extracted from the first read in each FASTQ file based on the following convention:
Read ID begins with | Instrument inferrred as |
---|---|
@A |
NovaSeq6000 |
@LH |
NovaSeqX |
@VL |
NextSeq1000 |
@VH |
NextSeq2000 |
Anything else | Unknown |
If the instrument type can be inferred, then an empirical q-table specific to that instrument type will be used to resolve Phred quality scores during the process of read resolution. If no instrument type can be inferred, then Phred quality scores will be resolved by selecting the lesser of the two associated Phred quality scores.
Use of the extracted flowcell ID and flowcell lane
The sample ID extracted from the FASTQ filename and the flowcell ID and flowcell lane extracted from the read ID of the first read in the FASTQ file will be used to construct a Read Group which will get passed into the aligner during the alignment step. The following example shows the Read Group generated for sample ID CEG900-123-678
, flowcell ID HJNLHDRX3
and flowcell lane 1
:
@RG ID:HJNLHDRX3.1 PL:ILLUMINA PU:HJNLHDRX3.1.CEG900-123-678 LB:CEG900-123-678 SM:CEG900-123-678
Sequencing run metadata file
The pipeline can use an optional metadata file in csv format, with one column per sample. The contents of cell A1 should be the text “sample_id
” ; the remaining cells in row 1 should contain sample IDs; the remaining cells in column A should contain metadata field names (there are no requirements about what these field names are, nor the quantity of them). The pipeline assumes that this file will be encoded in ASCII and that fields and field names will not contain any commas.
Here is an example of the content of a sequencing run metadata file that meets the requirements:
sample_id,CEG900-123-678,CEG900-123-679,CEG900-123-680,CEG900-123-681
sample-condition,case,control,case,control
sample-sex,male,male,female,female
lab-technician,JM,JM,ST,ST
project-id,X001,X001,X002,X002
The metadata file must be a csv file located in the --input-path
folder as described in the “File structure” section above. The filename can be anything, but the --meta-file
argument must match exactly to the name of the metadata file. Please only use the metadata filename, not the full path.
(back to main documentation) | (back to top)
2.3. Shell session persistence
We strongly recommend to use a persistent shell session like tmux to avoid analysis runs timing out.
If you run tmux
you can reconnect to it if the network connection is lost.
(back to main documentation) | (back to top)
2.4. Expected output
The following is a (truncated) example of what the output looks like after the biomodal analyse
command has been executed and Nextflow has started to run the the pipeline:
N E X T F L O W ~ version 24.04.2
Launching `/biomodal/biomodal-duet/main.nf` [trusting_magritte] DSL2 - revision: 9c351f91dd
executor > google-batch (3)
[- ] resolve_align:FASTQC_RAW -
[34/e1230c] resolve_align:CUTADAPT (CEG9330132-19-01, L001) [100%] 1 of 1 ✔
[- ] resolve_align:FASTQC_TRIMMED -
[e8/bf921e] resolve_align:COUPLET (CEG9330132-19-01, L001) [ 0%] 0 of 1
[- ] resolve_align:FASTQC_RESOLVED -
[- ] resolve_align:BWA_MEM2 -
...
...
2.4.1. Output location structure overview
Outputs from the Nextflow pipeline get published to your cloud storage bucket or output directory, organised into a directory structure.
At the top level, in your output location, you will have a directory for each sequencing run or dataset that you process, for example:
gs://my-org-nextflow/test-runxyz
At the next level down, Nextflow files are organised into subdirectories serving the following purposes:
Subdirectory | Purpose |
---|---|
nf-input | This is where the pipeline will look for input FASTQ files. FASTQ files should be copied into this directory prior to launching the pipeline. Please note that all FASTQ files undergo analysis regardless of sample info in the meta file. |
nf-work | This is the working directory for Nextflow. Logs and files staged for downstream processes will get stored here. At the next level down, there will be a directory that includes the pipeline version and the tags that you set when launching the pipeline. Beneath that, the directory structure will comprise of hashes which match those displayed in Nextflow Tower and which uniquely identify specific jobs launched by Nextflow. For example: gs://my-org-nextflow/test-runxyz/nf-work/duet-1-0.1.0_2021-05-15_1656_5bp/01/af2b9a7434a6ca435b96c6b84cb9a2 This directory is useful for debugging the pipeline, examining logs and viewing the STDOUT from jobs. Inside this directory there will be subdirectories associated with each pipeline execution. If the pipeline is running smoothly on your runs, you will rarely need to look in the nf-work directory. If a pipeline has completed successfully and you have no intention of resuming it with modified settings or examining logs, you can delete the contents of the associated subdirectory inside the nf-work directory. |
nf-results | This is the directory where the outputs of the pipeline are stored, organised by pipeline run, sample, pipeline module and lane. This directory is described further below. |
2.4.2. Exploring the nf-results subdirectory
The biomodal pipeline organises data into the following top-level directory structure in the nf-results
subdirectory
Subdirectory | Contents |
---|---|
reports | Sample-level and multi-sample reports summarising information about the samples and controls |
sample_outputs | Primary data files generated by the pipeline (described in more detail below) |
controls | BAM files and quantification files associated with the methylated lambda and unmethylated pUC19 controls. These small files are analogous to the BAM files and quantification files generated for your samples, and may be useful for familiarising yourself with the file formats. Note that there is an accompanying FASTA file for the controls in the reference file directory with the following name/location: ss_ctrls/v24/ss-ctrls-long-v24.fa.gz |
diagnostics | Secondary outputs from the pipeline including a parameters log recording the parameters that were used to execute the pipeline, more extensive metrics to support more detailed data investigations and the interim resolved FASTQ files that were passed into the aligner |
2.4.3. Key pipeline outputs
- In the
reports/summary_reports
subdirectory, you’ll find an aggregated summary report in Excel Format, which collates metrics from across all modules in the pipeline. This file has thePipeline_Summary
sheet containing the most useful metrics output by the pipeline. - In the
reports/sample_reports
subdirectory you’ll find a .html report for each sample containing assay-specific quality-related plots. - Deduplicated genome-aligned BAM files are located in the
sample_outputs/bams
subdirectory. - Stranded CpG methylation quantification files are located in the
sample_outputs/modc_quantification
subdirectory and have the file extension*.CG_quant.modc.bed.gz
. - Variant calling outputs are in the
sample_outputs/variant_call_files
subdirectory in .vcf format.
For a more detailed overview of the duet pipeline outputs and explanation of the formats, please see the Bioinformatics data interpretation guide.
2.4.4. Removing data from the temporary pipeline folders
The temporary pipeline folders created by the duet pipeline can be removed upon successful completion of the pipeline run. If a pipeline has completed successfully and you have no intention of resuming it with modified settings or examining logs, you can delete the contents of the associated subdirectory inside the nf-work
directory. Please see 2.4.1. Output location structure overview for more details about the pipeline folder structure.
(back to main documentation) | (back to top)
2.5. Examples of biomodal analyse commands
Here we have included a few exmples of how you can run the biomodal analyse
command with different parameters.
2.5.1. Running the test XYZ dataset on a local single node
biomodal analyse \
--input-path /biomodal/data_bucket/test-runxyz \
--meta-file CEGX_Run_meta.csv \
--output-path /biomodal/data_bucket/test-runxyz/nf-results \
--run-name CEGX_RunXYZ \
--tag CEGX_RunXYZ \
--additional-params lib_prefix=CEG9330132-19-01_S12_L001 \
--mode 5bp
2.5.2. Running the test XYZ dataset on a local single node with local_deep_seq profile,resume and max 16 CPUs and 32GB of memory
biomodal analyse \
--input-path /biomodal/data_bucket/test-runxyz \
--meta-file CEGX_Run_meta.csv \
--output-path /biomodal/data_bucket/test-runxyz/nf-results \
--run-name CEGX_RunXYZ_deep_seq \
--tag CEGX_RunXYZ_deep_seq \
--mode 5bp \
--additional-params cpu_num=16,memory="32GB" \
--additional-profile local_deep_seq \
--resume
2.5.3. Running on GCP Batch with targeted mode enabled using the twist_cancer panel
biomodal analyse \
--input-path gs://gcp-bucket/my_data \
--meta-file CEGX_Run_meta.csv \
--output-path gs://gcp-bucket/my_data \
--tag GCP_targeted_test \
--run-name GCP_targeted_test \
--reference-genome GRCh38Decoy \
--targeted true \
--target_panel twist_cancer \
--additional-profile deep_seq \
--additional-params umi=true,r1_r2_switch=true \
--mode 5bp