demo-data-instructions

duet multiomics solution demo datasets

This section covers reference sequencing data using the duet multiomics solution +modC or duet evoC reagents and bioinformatics pipeline. This data is freely available and can be utilised to explore the capabilities of duet +modC and duet evoC, assess benchmarks and reproduce performance, demo purposes and more. Please follow the instructions below for the desired dataset.

Table of Contents

Data Overview

There are three example datasets provided:

  • duet evoC dataset for a GIAB (Genome in a Bottle) NA12878 (HG001) gDNA sample
  • duet evoC dataset for a mouse ES-E14 gDNA sample
  • duet +modC dataset for a GIAB (Genome in a Bottle) NA12878 (HG001) gDNA sample. In this dataset, the multisample outputs (zarr store and QC reports) also feature additional samples to demonstrate the propeties of multi-sample output files

The available data has been generated using duet software version 1.5.0 and is representative the outputs and performance available from biomodal duet evoC and biomodal duet +modC kits.

For each dataset, the primary output files are grouped into subdirectories as follows:

Folder Description
reports Multi-sample QC summary reports available in a variety of formats (HTML, Excel, csv)
sample_outputs/allele_specific_methylation Allele Specific Methylation (ASM) files with one row per heterozygous allele. Methylation on reads associated with each allele is quantified and a call of allele specific methylation is provided where asymmetry is significant in the Fisher’s Exact test.
sample_outputs/bams Binary Alignment Map (BAM) files representing all sequences aligned to the reference genome. Methylation status is represented using a MM SAM tag.
sample_outputs/modc_quantification Plain-text quantification files reporting modified cytosine quantification (mC, hmC, or modC) for each genomic context: CG, CHG, CHH.
sample_outputs/variant_call_files/germline Variant Call Format (VCF) files containing called germline variants
sample_outputs/zarr_store Multi-sample compressed methylation quantification file format for compatibility with biomodal’s modality XPLR software.

Additionally, there are examples of the diagnostic files routinely published by the pipeline:

Folder Description
controls Diagnostic files related to the spike-in controls provided with the duet kits
diagnostics General diagnostic files related to the pipeline execution and resolved FASTQ files

duet evoC Dataset – NA12878

For this dataset, one GIAB NA12878 gDNA sample (80 ng input) was sonicated to 250bp fragments. Libraries were prepared using the duet evoC kit, were sequenced at 2×151 using a NovaSeq6000. Sequencing data was downsample to 1 billion input reads (sufficient for over 30X coverage) and processed using the biomodal duet software.

The following results are representative of routine processing that can be performed and achieved with the duet evoC kit.

Sample name Sample Origin Reads Mean Coverage
CEG1485-EL01-D1115-005 NA12878 (HG001) 1000M 32.6X

duet evoC Dataset – mouse ES E14 cell line

For this dataset, one mouse ES-E14 cell line was sonicated to 250bp fragments. Libraries were prepared using the duet evoC kit, were sequenced at 2×151 using a NovaSeq6000. Sequencing data was downsample to 1 billion input reads (sufficient for over 30X coverage) and processed using the biomodal duet software.

The following results are representative of routine processing that can be performed and achieved with the duet evoC kit.

Sample name Sample Origin Reads Mean Coverage
CEG1485-EL01-D1115-001 ES-E14 (mouse) 1000M 32.8X

duet +modC Dataset – NA12878 (incl. multi-sample output file examples)

For this dataset, 7 GIAB (Genome in a Bottle) gDNA samples, including the Ashkenazi Trio, each of 80ng input and sonicated to 250bp fragments, were prepared using the duet +modC kit and sequenced as 2×151 cycles using a NovaSeq6000. Sequencing data was downsampled to 1 billion input read-pairs (sufficient for over 30X coverage) and processed using the biomodal duet software.

The following table shows the samples sequenced.

Sample name Sample Origin Reads Mean Coverage
CEG1532-EL01-A1200-002 NA12878 (HG001) 1000M 31.3X
CEG1532-EL01-A1200-005 NA24385 (HG002) 1000M 31.2X
CEG1532-EL01-A1200-008 NA24149 (HG003) 1000M 31.3X
CEG1532-EL01-A1200-011 NA24143 (HG004) 1000M 31.3X
CEG1532-EL01-A1200-015 NA24631 (HG005) 1000M 31.4X
CEG1532-EL01-A1200-017 NA24694 (HG006) 1000M 31.4X
CEG1532-EL01-A1200-020 NA24695 (HG007) 1000M 31.0X

To limit the size of this dataset:

  • Multisample pipeline outputs such as QC reports, and the methylation quantification zarr store, have been made available featuring all 7 samples. This demonstrates the multi-sample properties/content of these files.

  • For single-sample pipeline outputs, such as BAM files and plain-text methylation quantification samples, only the outputs relating to one sample (NA12878) have been provided.

Download instructions

Install gcloud CLI and authenticatate

  1. Please create a Google account using your institutional email address by selecting “Use your existing email address” option during the account creation. If you already have a Google account, please omit this step.

  2. Download and install the Google CLI. Make sure that the gcloud init command has been run to authenticate using your Google account.

If you are using Windows OS and are not able to use bash, you can use the following commands to download the folders individually via the gcloud storage commands listed below for +modC and duet evoC datasets.

For each dataset, it is possible to download:

  • The complete output dataset
  • The input FASTQ files

duet evoC GIAB (Genome in a Bottle) gDNA sample

Complete output dataset

Folder Description Approximate Size Download command
giab/evoC/1.5.0 Complete duet evoC GIAB output dataset 196 GiB gcloud storage cp --recursive gs://biomodal-data/giab/evoC/1.5.0 ./

Input FASTQ files

Folder Description Approximate Size Download command
giab/evoC/input duet evoC GIAB input FASTQ files 125 GiB gcloud storage cp --recursive gs://biomodal-data/giab/evoC/input ./

duet evoC ES E14 mouse gDNA sample

Complete output dataset

Folder Description Approximate Size Download command
mouse/evoC/1.5.0 Complete duet evoC mouse output dataset 184 GiB gcloud storage cp --recursive gs://biomodal-data/mouse/evoC/1.5.0 ./

Input FASTQ files

Folder Description Approximate Size Download command
mouse/evoC/input duet evoC mouse input FASTQ files 123 GiB gcloud storage cp --recursive gs://biomodal-data/mouse/evoC/input ./

duet +modC GIAB (Genome in a Bottle) gDNA sample

Complete output dataset

Folder Description Approximate Size Download command
giab/+modC/1.5.0 Complete duet +modC GIAB output dataset 216 GiB gcloud storage cp --recursive gs://biomodal-data/giab/+modC/1.5.0 ./

Input FASTQ files

Folder Description Approximate Size Download command
giab/+modC/input duet +modC GIAB input FASTQ files 124 GiB gcloud storage cp --recursive gs://biomodal-data/giab/+modC/input ./

Reference FASTA files

Folder Description Approximate Size Download command
reference_fasta/human Human reference genome GRCh38 753 MiB gcloud storage cp --recursive gs://biomodal-data/reference_fasta/human ./
reference_fasta/mouse Mouse reference genome GRCm38.p6 703 MiB gcloud storage cp --recursive gs://biomodal-data/reference_fasta/mouse ./
reference_fasta/controls duet spike-in controls 17 KiB gcloud storage cp --recursive gs://biomodal-data/reference_fasta/controls ./

Further information and assistance

For additional information, assistance, or for access to demo data for previous pipeline versions, please contact support@biomodal.com.

What are you looking for?