Analyse multimodal data at scale
Providing insight and guidance for future experiments in a single workflow
Our comprehensive software and data analysis package allows you to easily identify methylation profiles, genetic variants, and variant-associated methylation in a single sample or multi-sample cohort.
Core processing
Takes output data directly from your sequencing run, and processes it into an integrated/combined genetic and epigenetic format ready for further analysis.
Steps
- Read resolution: pairs the original and copy strand sequences from each DNA fragment, and combines them into a single resolved sequence encoding the genetic and epigenetic status at each base.
- Trims any hairpin or sequencing adapter bases using cutadapt
- Quality filtering
Outputs
- Resolved fastQ files ready for further analysis
Genome alignment
Aligns the resolved sequence data against a reference genome to report on variants, and against the methylation controls to assess the quality of genetic and epigenetic calling.
Steps
- Align against your defined reference genome and the spike-in controls using BWA-MEM
- Duplicates are removed using Picard MarkDuplicates
- Epigenetic quantification – methylation status is counted at each CpG site by default, and at CHG and CHH sites if specified by you
- Accuracy of genetic and epigenetic calls are measured using the spike-in controls
- Variant calling – germline variants called using GATK HaplotypeCaller, and you can specify to call somatic variants using Mutect2.
- Epigenetic calls are conserved from FASTQ to BAM files using MM tags
Outputs
- BAM files including MM tags
- Variant Calling Files (VCF)
- duet Cytosine report for epigenetic quantification
Functional analysis
Explore your data further with comprehensive analysis tools. Investigate the correlation between genetic and epigenetic data, compare across genomic regions or contexts that you define, and perform cohort analysis with multiple data sets.
Steps
- Variant associated methylation (VAM): see how genetic and epigenetic information are correlated in your samples.
- Allele-specific methylation (ASM) – resolved reads are separated into alleles at heterozygous SNV sites. Methylation status is quantified at CpGs associated with each allele, and labelled as ASM if methylation levels differ by more than 30% between alleles.
- Perform exploratory analysis:
- Use genomic windows to summarise metrics in regions of interest eg. Known regions of open chromatin
- Use contextual annotations to summarise metrics at genetic regions such as gene bodies or areas correlated with high expression
- Combine outputs into feature sets. Plot summaries of feature sets
- Perform cohort analysis:
- Compare across multiple experiments using genomic windows defined by you or taken from an external definition.
- Identify differentially methylated regions (DMRs) and differentially hydroxymethylated regions (DhMRs)
Outputs
- ASM files
- Zarr files to enable efficient processing of multidimensional cohort data
- Comprehensive summary plots
The duet pipeline can seamlessly perform read resolution, alignment, variant calling, summarisation of methylation status and produce reports and QC information. Here is what you can expect in terms of processing times:
- Running in a cloud tenancy, allocating exactly the compute resources required at each stage: ~30 hours
- Running on a single dedicated host with 32 cores and 64 GB RAM: 11-12 days
- Using a Slurm cluster of 4 hosts, each with 32 cores and 64 GB RAM: 3-4 days
The above estimations are highly dependent on your environment. The timings have been calculated assuming 8 samples and approximately 30X coverage.