Advanced-features

Static Badge Static Badge Static Badge

Table of Contents

(back to main documentation)

3.1. CHG and CHH modification calling

By default, the pipeline calls modified cytosines in CpG context only.
However, the pipeline includes the capability to call modified cytosines at CHG and CHH sites (where H is the IUPAC disambiguation code representing a DNA nucleotide other than G). To invoke CHG and CHH modification calling, add the following flag when launching the pipeline:

--chg-chh-contexts true

(back to main documentation) | (back to top)

3.2. Mapping Quality Filtering

By default, the duet pipeline discards reads from the BAM file with a MAPQ (Mapping Quality) score of 0. However, you can adjust the stringency of this filter as needed.

To include all reads regardless of mapping quality:

--additional-params bamlet.min_map_qual=0

To retain only reads in the BAM with MAPQ >=20, thereby discarding lower-confidence mappings:

--additional-params bamlet.min_map_qual=20

The pipeline also supports more stringent MAPQ filtering specifically for reads used in epigenetic quantification. For example, to keep reads in the BAM with MAPQ >=1 but restrict epigenetic quantification to reads with MAPQ >=20:

--additional-params bamlet.min_map_qual=1,epiquant.min_map_qual=20

(back to main documentation) | (back to top)

3.3. Epigenetic Quantification Options

3.3.1. Epigenetic Quantification Output Formats

In addition to the zarr store, the duet pipeline is able to produce four types of plain-text epigenetic quantification outputs; bedmethyl, cxreport, bismark or bedgraph. The default is to produce cxreport, but any combination can be requested. To produce cxreport and bismark:

--quantification-output cxreport,bismark

3.3.2. Limit epigenetic quantification to primary chromosomes

By default, the duet pipeline will quantify epigenetics at each CG context present on contigs included in the primary assembly BED (if the species has an available primary assembly BED). In GRCh38 Human reference this includes unlocalised scaffolds (e.g. chr1_KI270706v1_random). To exclude these scaffolds from epigenetic quantification (note they will still be present in the BAM):

--additional-params epiquant.primary_chromosomes_only=true

3.3.3. Include SNVs in CG coverage

By default, the pipeline counts total coverage at a reference CG position as the number of Cs (modified or unmodified) at a given position. However, you can optionally include A, G and T base calls in total coverage:

--additional-params modality.export.only_c_in_coverage=false

Note the total coverage in methylation quantification outputs will be used as the denominator to calculate percenatge methylated/hydroxymethylated.

3.3.4. Output uncovered CG contexts

By default any reference CG contexts which are uncovered in a given sample will be omitted from the plain-text quantification outputs. To include zero-coverage CG contexts in quantification outputs:

--additional-params modality.export.skip_rows_zero_counts=false

(back to main documentation) | (back to top)

3.4. Variant Associated Methylation

The duet pipeline contains a module designed to identify various types of variant methylation, such as Allele Specific Methlation (ASM). See below for further instructions.

(back to main documentation) | (back to top)

3.4.1. Allele Specific Methylation

At sites where there is a heterozygous allele, it is possible to quantify differences in methylation between the two alleles. The pipeline includes this capability, but it is not run by default. This capability can be activated by added the following flag when launching the pipeline:

--compute-asm true

(back to main documentation) | (back to top)

3.4.2. Joint Variant Calling

The pipeline also supports joint variant calling and joint VAM calling. This can be turned on by adding the following parameters when running the pipeline.

--compute-asm true --use-gvcf true

(back to main documentation) | (back to top)

3.4.3. Somatic Variant Calling

The pipeline also supports somatic variant calling which is performed using Mutect2 followed by GATK-recommended filtering using FilterMutectCalls. This is performed in the ‘tumour-only’ mode of Mutect2 (i.e. it is not run using pairing of a tumour / matched normal).

Somatic variant calling can be activated by adding the following parameters when running the pipeline.

--additional-params call_somatic_variants=true

(back to main documentation) | (back to top)

3.5. Reduce publish footprint

The expected published footprint of the pipeline is approximately 1.4X the footprint of input FASTQs. However, this can be reduced by optionally not publishing resolved FASTQs, publishing CRAMs instead of BAMs, or both.

(back to main documentation) | (back to top)

3.5.1. Disable Resolved FASTQ publishing

The duet pipeline ‘resolves’ input deaminated R1/R2 FASTQ file-pairs for a sample into a ‘resolved FASTQ’ file containing trimmed single-end reads with epigenomic tags encoding the modifications present.

This a large file, making up >50% of the published footprint, and is NOT required for downstream analysis. Therefore you may wish to disable publishing of this file.

--additional-params publish_by_default.prelude=false

(back to main documentation) | (back to top)

3.5.2. CRAMs

The duet pipeline outputs sequence alignments as BAM files by default, but offers the option to output genome sequence alignments as CRAM files. CRAMs will be outputted with a reference genome fasta (which is required to read CRAMs).

Outputting CRAM files saves ~40% disk space compared to BAM files.

--additional-params publish_crams=true

(back to main documentation) | (back to top)

3.6. Subsampling

If you wish to subsample the reads in your sample prior to processing (such as to ensure equal numbers of reads in each sample for inter-library comparison purposes), then this is possible by adding the following parameter and indicating the maximum number of reads that you want to retain on a single lane for a single sample:

--additional-params subsample=50000000

Note: This subsampling is performed per sample, per lane, so in the above example, if there were 4 lanes, then this would limit the processing to 50M reads per sample per lane, so 200M reads in total per sample.

(back to main documentation) | (back to top)

3.7. Alternative Supported Reference Genomes

By default the duet pipeline will align against a reference file containing the human reference genome GRCh38 and the sequences of the spike-in controls supplied with the kit.
It is also possible to run the pipeline relative to other available reference genomes (e.g. mouse) first by downloading the preferred reference data from biomodal:

biomodal reference list
biomodal reference download GRCm38p6

Then the following flag can be used when launching the pipeline:

--reference-genome GRCm38p6

Alternatively you can create your own reference genome by following the instructions in the make reference genome documentation.

(back to main documentation) | (back to top)

3.7.1. Supported Mouse References

The duet pipeline supports three versions of the mouse reference; GRCm39, GRCm38p6 and GRCm38.

GRCm38p6 (2017): final patch version of GRCm38, incorporating cumulative patch updates into the primary assembly. Use this reference for studies requiring compatibility with legacy GRCm38 tools and datasets.

GRCm39 (2020): latest mouse genome assembly without legacy patches.

GRCm38 (2012): original GRCm38 with additional patch fix contigs, resulting in a large reference (>12 Gbp) requiring more resource to process data. Only use if your study requires detailed patch-level alignment. Note that reads aligning to contigs which are not present in the primary assembly bed (including patch fix contigs) will be discarded by the duet pipeline.

(back to main documentation) | (back to top)

3.8. Elevated resource profiles

Each module file in the duet pipeline contains default resource settings in relation to CPUs, memory, and disk. These settings are intended to be sufficient for shallow sequencing runs where you have up to 50 million reads per sample per lane. On deeper sequencing runs, you should invoke an elevated resource which will allocate additional resources to jobs. For the deepest sequencing runs in UMI-mode, it’s also necessary to invoke a CLI parameter that will distribute the duplicate identification step across intervals/contigs:

Reads per sample per lane Profiles
Up to 50 million Default, no additional profile required
Up to 500 million --additional-profile deep_seq
Over 500 million --additional-profile super_seq

Invoking any of these profiles will require additional CPU, RAM and disk resources to the duet pipeline. Please see the module requirements document for more information.

If you have very deeply sequenced samples with over 2 billion reads per sample per lane and are using UMIs then you should additionally include the following parameter when launching the duet pipeline: --additional-params dedup_by_contigs=true

(back to main documentation) | (back to top)

3.8.1. Limit local resource requirements

If you have to limit the duet pipeline CPU and RAM resources for the local executor on a single machine, you can use the cpu_num and memory parameters in conjuntion with the local or local_deep_seq profiles.

Please note this only works with the local executor, so this will not have any effect if you are using HPC or cloud based executors.

Below is an example that will use the local_deep_seq profile and additionally limit the pipeline usage to a maximum of 32 CPUs and 128GB RAM:

--additional-profile local_deep_seq --additional-params cpu_num=32,memory="128GB"

(back to main documentation) | (back to top)

3.8.2. duet pipeline module requirements

Please see the module requirements document for a complete list of module requirements for the duet pipeline and D(h)MR workflow.

(back to main documentation) | (back to top)

3.9. Resuming a disrupted pipeline

If a pipeline has been disrupted or aborted, it is possible to resume it from where it left off by executing the same command with the added parameter --resume in the same directory as the previous pipeline. This might be useful if your terminal session is accidentally terminated, such as by a break in network connectivity. It may also be useful if you want to deliberately stop and resume a pipeline, for example if you launched the pipeline outside of a persistent shell session and want to stop it to resume within a persistent shell session.
Please ensure that you use exactly the same parameters as the previous pipeline when resuming.

(back to main documentation) | (back to top)

3.10. Low GCP CPU quotas

If you have a very low concurrent CPU quota in GCP (for instance if you are using a free trial account where the number of consecutive CPUs is limited to 24) then please seek guidance from biomodal and you may need a custom resource profile to accommodate the low quotas.

(back to main documentation) | (back to top)

3.11. Stopping and starting your Orchestration VM

You only need to have your orchestration VM running when you are actively running pipelines. You can start and stop your Orchestration VM from the Compute Engine -> VM instances page of the Google Cloud Console.

(back to main documentation) | (back to top)

3.12. Processing individual samples

Due to resource constraints, there may be a need to process individual samples in the duet pipeline. This will require a smaller amount of CPU and RAM resources as you will analyse each sample separately. You can use the lib_prefix parameter when executing the duet pipeline to specify the prefix of the library to process. You can add the prefix to the --additional-params parameter to your biomodal analyse command as per this example:

--additional-params lib_prefix=SAMPLE1_L001

(back to main documentation) | (back to top)

3.13. Enable generation of the FASTQC report

By default the FASTQ report is not generated in the duet pipeline. To enable generation of the report on your raw FASTQ files, please add the following parameter to biomodal analyse when launching the duet pipeline:

--additional-params run_by_default.fastqc.raw=true

(back to main documentation) | (back to top) | (Next)

Cambridge Epigenetix is now biomodal