duet multiomics solution demo datasets
This section covers reference sequencing data using the duet multiomics solution +modC or duet evoC reagents and bioinformatics pipeline. This data is freely available and can be utilised to explore the capabilities of duet +modC and duet evoC, assess benchmarks and reproduce performance, demo purposes and more. Please follow the instructions below for the desired dataset.
Table of Contents
- duet multiomics solution demo datasets
- Table of Contents
- Data Overview
- duet evoC Dataset – NA12878
- duet evoC Dataset – mouse ES E14 cell line
- duet +modC Dataset – NA12878 (incl. multi-sample output file examples)
- Download instructions
- Install gcloud CLI and authenticatate
- duet evoC GIAB (Genome in a Bottle) gDNA sample
- duet evoC ES E14 mouse gDNA sample
- duet +modC GIAB (Genome in a Bottle) gDNA sample
- Reference FASTA files
- Further information and assistance
Data Overview
There are three example datasets provided:
- duet evoC dataset for a GIAB (Genome in a Bottle) NA12878 (HG001) gDNA sample
- duet evoC dataset for a mouse ES-E14 gDNA sample
- duet +modC dataset for a GIAB (Genome in a Bottle) NA12878 (HG001) gDNA sample. In this dataset, the multisample outputs (zarr store and QC reports) also feature additional samples to demonstrate the propeties of multi-sample output files
The available data has been generated using duet software version 1.5.0 and is representative the outputs and performance available from biomodal duet evoC and biomodal duet +modC kits.
For each dataset, the primary output files are grouped into subdirectories as follows:
| Folder | Description |
|---|---|
reports |
Multi-sample QC summary reports available in a variety of formats (HTML, Excel, csv) |
sample_outputs/allele_specific_methylation |
Allele Specific Methylation (ASM) files with one row per heterozygous allele. Methylation on reads associated with each allele is quantified and a call of allele specific methylation is provided where asymmetry is significant in the Fisher’s Exact test. |
sample_outputs/bams |
Binary Alignment Map (BAM) files representing all sequences aligned to the reference genome. Methylation status is represented using a MM SAM tag. |
sample_outputs/modc_quantification |
Plain-text quantification files reporting modified cytosine quantification (mC, hmC, or modC) for each genomic context: CG, CHG, CHH. |
sample_outputs/variant_call_files/germline |
Variant Call Format (VCF) files containing called germline variants |
sample_outputs/zarr_store |
Multi-sample compressed methylation quantification file format for compatibility with biomodal’s modality XPLR software. |
Additionally, there are examples of the diagnostic files routinely published by the pipeline:
| Folder | Description |
|---|---|
controls |
Diagnostic files related to the spike-in controls provided with the duet kits |
diagnostics |
General diagnostic files related to the pipeline execution and resolved FASTQ files |
duet evoC Dataset – NA12878
For this dataset, one GIAB NA12878 gDNA sample (80 ng input) was sonicated to 250bp fragments. Libraries were prepared using the duet evoC kit, were sequenced at 2×151 using a NovaSeq6000. Sequencing data was downsample to 1 billion input reads (sufficient for over 30X coverage) and processed using the biomodal duet software.
The following results are representative of routine processing that can be performed and achieved with the duet evoC kit.
| Sample name | Sample Origin | Reads | Mean Coverage |
|---|---|---|---|
| CEG1485-EL01-D1115-005 | NA12878 (HG001) | 1000M | 32.6X |
duet evoC Dataset – mouse ES E14 cell line
For this dataset, one mouse ES-E14 cell line was sonicated to 250bp fragments. Libraries were prepared using the duet evoC kit, were sequenced at 2×151 using a NovaSeq6000. Sequencing data was downsample to 1 billion input reads (sufficient for over 30X coverage) and processed using the biomodal duet software.
The following results are representative of routine processing that can be performed and achieved with the duet evoC kit.
| Sample name | Sample Origin | Reads | Mean Coverage |
|---|---|---|---|
| CEG1485-EL01-D1115-001 | ES-E14 (mouse) | 1000M | 32.8X |
duet +modC Dataset – NA12878 (incl. multi-sample output file examples)
For this dataset, 7 GIAB (Genome in a Bottle) gDNA samples, including the Ashkenazi Trio, each of 80ng input and sonicated to 250bp fragments, were prepared using the duet +modC kit and sequenced as 2×151 cycles using a NovaSeq6000. Sequencing data was downsampled to 1 billion input read-pairs (sufficient for over 30X coverage) and processed using the biomodal duet software.
The following table shows the samples sequenced.
| Sample name | Sample Origin | Reads | Mean Coverage |
|---|---|---|---|
| CEG1532-EL01-A1200-002 | NA12878 (HG001) | 1000M | 31.3X |
| CEG1532-EL01-A1200-005 | NA24385 (HG002) | 1000M | 31.2X |
| CEG1532-EL01-A1200-008 | NA24149 (HG003) | 1000M | 31.3X |
| CEG1532-EL01-A1200-011 | NA24143 (HG004) | 1000M | 31.3X |
| CEG1532-EL01-A1200-015 | NA24631 (HG005) | 1000M | 31.4X |
| CEG1532-EL01-A1200-017 | NA24694 (HG006) | 1000M | 31.4X |
| CEG1532-EL01-A1200-020 | NA24695 (HG007) | 1000M | 31.0X |
To limit the size of this dataset:
-
Multisample pipeline outputs such as QC reports, and the methylation quantification zarr store, have been made available featuring all 7 samples. This demonstrates the multi-sample properties/content of these files.
-
For single-sample pipeline outputs, such as BAM files and plain-text methylation quantification samples, only the outputs relating to one sample (NA12878) have been provided.
Download instructions
Install gcloud CLI and authenticatate
-
Please create a Google account using your institutional email address by selecting “Use your existing email address” option during the account creation. If you already have a Google account, please omit this step.
-
Download and install the Google CLI. Make sure that the
gcloud initcommand has been run to authenticate using your Google account.
If you are using Windows OS and are not able to use bash, you can use the following commands to download the folders individually via the gcloud storage commands listed below for +modC and duet evoC datasets.
For each dataset, it is possible to download:
- The complete output dataset
- The input FASTQ files
duet evoC GIAB (Genome in a Bottle) gDNA sample
Complete output dataset
| Folder | Description | Approximate Size | Download command |
|---|---|---|---|
| giab/evoC/1.5.0 | Complete duet evoC GIAB output dataset | 196 GiB | gcloud storage cp --recursive gs://biomodal-data/giab/evoC/1.5.0 ./ |
Input FASTQ files
| Folder | Description | Approximate Size | Download command |
|---|---|---|---|
| giab/evoC/input | duet evoC GIAB input FASTQ files | 125 GiB | gcloud storage cp --recursive gs://biomodal-data/giab/evoC/input ./ |
duet evoC ES E14 mouse gDNA sample
Complete output dataset
| Folder | Description | Approximate Size | Download command |
|---|---|---|---|
| mouse/evoC/1.5.0 | Complete duet evoC mouse output dataset | 184 GiB | gcloud storage cp --recursive gs://biomodal-data/mouse/evoC/1.5.0 ./ |
Input FASTQ files
| Folder | Description | Approximate Size | Download command |
|---|---|---|---|
| mouse/evoC/input | duet evoC mouse input FASTQ files | 123 GiB | gcloud storage cp --recursive gs://biomodal-data/mouse/evoC/input ./ |
duet +modC GIAB (Genome in a Bottle) gDNA sample
Complete output dataset
| Folder | Description | Approximate Size | Download command |
|---|---|---|---|
| giab/+modC/1.5.0 | Complete duet +modC GIAB output dataset | 216 GiB | gcloud storage cp --recursive gs://biomodal-data/giab/+modC/1.5.0 ./ |
Input FASTQ files
| Folder | Description | Approximate Size | Download command |
|---|---|---|---|
| giab/+modC/input | duet +modC GIAB input FASTQ files | 124 GiB | gcloud storage cp --recursive gs://biomodal-data/giab/+modC/input ./ |
Reference FASTA files
| Folder | Description | Approximate Size | Download command |
|---|---|---|---|
| reference_fasta/human | Human reference genome GRCh38 | 753 MiB | gcloud storage cp --recursive gs://biomodal-data/reference_fasta/human ./ |
| reference_fasta/mouse | Mouse reference genome GRCm38.p6 | 703 MiB | gcloud storage cp --recursive gs://biomodal-data/reference_fasta/mouse ./ |
| reference_fasta/controls | duet spike-in controls | 17 KiB | gcloud storage cp --recursive gs://biomodal-data/reference_fasta/controls ./ |
Further information and assistance
For additional information, assistance, or for access to demo data for previous pipeline versions, please contact support@biomodal.com.