Table of Contents
- 3.1. CHG and CHH modification calling
- 3.2. Mapping Quality Filtering
- 3.3. Epigenetic Quantification Options
- 3.4. Variant Calling and Variant Associated Methylation
- 3.5. Reduce publish footprint
- 3.6. Subsampling
- 3.7. Alternative Supported Reference Genomes
- 3.8. Elevated resource profiles
- 3.9. Resuming a disrupted pipeline
- 3.10. Low GCP CPU quotas
- 3.11. Stopping and starting your Orchestration VM
- 3.12. Processing individual samples
- 3.13. Enable generation of the FASTQC report
3.1. CHG and CHH modification calling
By default, the pipeline calls modified cytosines in CpG context only.
However, the pipeline includes the capability to call modified cytosines at CHG and CHH sites (where H is the IUPAC disambiguation code representing a DNA nucleotide other than G). To invoke CHG and CHH modification calling, add the following flag when launching the pipeline:
--chg-chh-contexts true
(back to main documentation) | (back to top)
3.2. Mapping Quality Filtering
By default, the duet pipeline discards reads from the BAM file with a MAPQ (Mapping Quality) score of 0. However, you can adjust the stringency of this filter as needed.
To include all reads regardless of mapping quality:
--additional-params bamlet.min_map_qual=0
To retain only reads in the BAM with MAPQ >=20, thereby discarding lower-confidence mappings:
--additional-params bamlet.min_map_qual=20
The pipeline also supports more stringent MAPQ filtering specifically for reads used in epigenetic quantification. For example, to keep reads in the BAM with MAPQ >=1 but restrict epigenetic quantification to reads with MAPQ >=20:
--additional-params bamlet.min_map_qual=1,epiquant.min_map_qual=20
(back to main documentation) | (back to top)
3.3. Epigenetic Quantification Options
3.3.1. Epigenetic Quantification Output Formats
In addition to the zarr store, the duet pipeline is able to produce four types of plain-text epigenetic quantification outputs; bedmethyl, cxreport, bismark or bedgraph. The default is to produce cxreport, but any combination can be requested. To produce cxreport and bismark:
--quantification-output cxreport,bismark
3.3.2. Limit epigenetic quantification to primary chromosomes
By default, the duet pipeline will quantify epigenetics at each CG context present on contigs included in the primary assembly BED (if the species has an available primary assembly BED). In GRCh38
Human reference this includes unlocalised scaffolds (e.g. chr1_KI270706v1_random). To exclude these scaffolds from epigenetic quantification (note they will still be present in the BAM):
--additional-params epiquant.primary_chromosomes_only=true
3.3.3. Include SNVs in CG coverage
By default, the pipeline counts total coverage at a reference CG position as the number of Cs (modified or unmodified) at a given position. However, you can optionally include A, G and T base calls in total coverage:
--additional-params modality.export.only_c_in_coverage=false
Note the total coverage in methylation quantification outputs will be used as the denominator to calculate percenatge methylated/hydroxymethylated.
3.3.4. Output uncovered CG contexts
By default any reference CG contexts which are uncovered in a given sample will be omitted from the plain-text quantification outputs. To include zero-coverage CG contexts in quantification outputs:
--additional-params modality.export.skip_rows_zero_counts=false
(back to main documentation) | (back to top)
3.4. Variant Calling and Variant Associated Methylation
The duet pipeline contains modes for calling germline variants using GATK HaplotypeCaller and for calling somatic variants using the GATK Mutect2.
Additionally, an Allele Specific Methlation (ASM) module can be activated which will evaluate each heterozygous germline variant for the presence of methylation differences between the two alleles.
3.4.1. Germline Variant Calling
By default, germline variant calling is performed on each sample individually using GATK HaplotypeCaller generating a separate VCF file for each sample.
Germline variant calling can be disabled by setting the following parameter when launching the pipeline:
--additional-params call_germline_variants=false
3.4.2. Joint Variant Calling
The pipeline also supports joint variant calling across multiple samples. Joint variant calling can be activated by adding the following parameter when launching the pipeline:
--use-gvcf true
This will cause the GATK HaplotypeCaller to generate per-sample gVCF files (instead of VCF files) which will then be consolidated via GATK GenomicsDBImport followed by joint genotyping using GATK GenotypeGVCFs.
3.4.3. Allele Specific Methylation
The Allele Specific Methylation (ASM) module can be activated to evalute each heterozygous germline variant for the presence of methylation differences between the two alleles. The ASM module is disabled by default, but can be activated by adding the following parameter when launching the pipeline:
--compute-asm true
To perform ASM calling, variants will be ‘atomized’ using bcftools norm so that complex variants are decomposed – i.e. MNVs will be split into consecutive SNVs. An ASM file will be generated for each sample.
If the --compute-asm true
parameter is combined with the --use-gvcf true
parameter described above, an ASM file will still be generated for each sample, but ASM calling will be performed using the atomized outputs from joint variant calling. A subsequent step will combine per-sample ASM calls into a single file.
3.4.4. Somatic Variant Calling
The pipeline also supports somatic variant calling which is performed using GATK Mutect2 followed by GATK-recommended filtering using FilterMutectCalls. This is performed in the ‘tumour-only’ mode of Mutect2 (i.e. it is not run using pairing of a tumour / matched normal).
Somatic variant calling can be activated by adding the following parameter when launching the pipeline.
--additional-params call_somatic_variants=true
The filtering of somatic variants using FilterMutectCalls happens by default if somatic variant calling is activated, but filtering can be disabled by adding the following parameter when launching the pipeline:
--additional-params filter_somatic_variants=false
(back to main documentation) | (back to top)
3.5. Reduce publish footprint
The expected published footprint of the pipeline is approximately 1.4X the footprint of input FASTQs. However, this can be reduced by optionally not publishing resolved FASTQs, publishing CRAMs instead of BAMs, or both.
(back to main documentation) | (back to top)
3.5.1. Disable Resolved FASTQ publishing
The duet pipeline ‘resolves’ input deaminated R1/R2 FASTQ file-pairs for a sample into a ‘resolved FASTQ’ file containing trimmed single-end reads with epigenomic tags encoding the modifications present.
This a large file, making up >50% of the published footprint, and is NOT required for downstream analysis. Therefore you may wish to disable publishing of this file.
--additional-params publish_by_default.prelude=false
(back to main documentation) | (back to top)
3.5.2. CRAMs
The duet pipeline outputs sequence alignments as BAM files by default, but offers the option to output genome sequence alignments as CRAM files. CRAMs will be outputted with a reference genome fasta (which is required to read CRAMs).
Outputting CRAM files saves ~40% disk space compared to BAM files.
--additional-params publish_crams=true
(back to main documentation) | (back to top)
3.6. Subsampling
If you wish to subsample the reads in your sample prior to processing (such as to ensure equal numbers of reads in each sample for inter-library comparison purposes), then this is possible by adding the following parameter and indicating the maximum number of reads that you want to retain on a single lane for a single sample:
--additional-params subsample=50000000
Note: This subsampling is performed per sample, per lane, so in the above example, if there were 4 lanes, then this would limit the processing to 50M reads per sample per lane, so 200M reads in total per sample.
(back to main documentation) | (back to top)
3.7. Alternative Supported Reference Genomes
By default the duet pipeline will align against a reference file containing the human reference genome GRCh38 and the sequences of the spike-in controls supplied with the kit.
It is also possible to run the pipeline relative to other available reference genomes (e.g. mouse) first by downloading the preferred reference data from biomodal:
biomodal reference list
biomodal reference download GRCm38p6
Then the following flag can be used when launching the pipeline:
--reference-genome GRCm38p6
Alternatively you can create your own reference genome by following the instructions in the make reference genome documentation.
(back to main documentation) | (back to top)
3.7.1. Supported Mouse References
The duet pipeline supports three versions of the mouse reference; GRCm39, GRCm38p6 and GRCm38.
GRCm38p6 (2017): final patch version of GRCm38, incorporating cumulative patch updates into the primary assembly. Use this reference for studies requiring compatibility with legacy GRCm38 tools and datasets.
GRCm39 (2020): latest mouse genome assembly without legacy patches.
GRCm38 (2012): original GRCm38 with additional patch fix contigs, resulting in a large reference (>12 Gbp) requiring more resource to process data. Only use if your study requires detailed patch-level alignment. Note that reads aligning to contigs which are not present in the primary assembly bed (including patch fix contigs) will be discarded by the duet pipeline.
(back to main documentation) | (back to top)
3.8. Elevated resource profiles
Each module file in the duet pipeline contains default resource settings in relation to CPUs, memory, and disk. These settings are intended to be sufficient for shallow sequencing runs where you have up to 50 million reads per sample per lane. On deeper sequencing runs, you should invoke an elevated resource which will allocate additional resources to jobs. For the deepest sequencing runs in UMI-mode, it’s also necessary to invoke a CLI parameter that will distribute the duplicate identification step across intervals/contigs:
Reads per sample per lane | Profiles |
---|---|
Up to 50 million | Default, no additional profile required |
Up to 500 million | --additional-profile deep_seq |
Over 500 million | --additional-profile super_seq |
Invoking any of these profiles will require additional CPU, RAM and disk resources to the duet pipeline. Please see the module requirements document for more information.
If you have very deeply sequenced samples with over 2 billion reads per sample per lane and are using UMIs then you should additionally include the following parameter when launching the duet pipeline: --additional-params dedup_by_contigs=true
(back to main documentation) | (back to top)
3.8.1. Limit local resource requirements
If you have to limit the duet pipeline CPU and RAM resources for the local
executor on a single machine, you can use the cpu_num
and memory
parameters in conjuntion with the local
or local_deep_seq
profiles.
Please note this only works with the local
executor, so this will not have any effect if you are using HPC or cloud based executors.
Below is an example that will use the local_deep_seq
profile and additionally limit the pipeline usage to a maximum of 32 CPUs and 128GB RAM:
--additional-profile local_deep_seq --additional-params cpu_num=32,memory="128GB"
(back to main documentation) | (back to top)
3.8.2. duet pipeline module requirements
Please see the module requirements document for a complete list of module requirements for the duet pipeline and D(h)MR workflow.
(back to main documentation) | (back to top)
3.9. Resuming a disrupted pipeline
If a pipeline has been disrupted or aborted, it is possible to resume it from where it left off by executing the same command with the added parameter --resume
in the same directory as the previous pipeline. This might be useful if your terminal session is accidentally terminated, such as by a break in network connectivity. It may also be useful if you want to deliberately stop and resume a pipeline, for example if you launched the pipeline outside of a persistent shell session and want to stop it to resume within a persistent shell session.
Please ensure that you use exactly the same parameters as the previous pipeline when resuming.
(back to main documentation) | (back to top)
3.10. Low GCP CPU quotas
If you have a very low concurrent CPU quota in GCP (for instance if you are using a free trial account where the number of consecutive CPUs is limited to 24) then please seek guidance from biomodal and you may need a custom resource profile to accommodate the low quotas.
(back to main documentation) | (back to top)
3.11. Stopping and starting your Orchestration VM
You only need to have your orchestration VM running when you are actively running pipelines. You can start and stop your Orchestration VM from the Compute Engine -> VM instances
page of the Google Cloud Console.
(back to main documentation) | (back to top)
3.12. Processing individual samples
Due to resource constraints, there may be a need to process individual samples in the duet pipeline. This will require a smaller amount of CPU and RAM resources as you will analyse each sample separately. You can use the lib_prefix
parameter when executing the duet pipeline to specify the prefix of the library to process. You can add the prefix to the --additional-params
parameter to your biomodal analyse
command as per this example:
--additional-params lib_prefix=SAMPLE1_L001
(back to main documentation) | (back to top)
3.13. Enable generation of the FASTQC report
By default the FASTQ report is not generated in the duet pipeline. To enable generation of the report on your raw FASTQ files, please add the following parameter to biomodal analyse
when launching the pipeline:
--additional-params run_by_default.fastqc.raw=true