Call-dmr-pipeline

Static Badge Static Badge Static Badge

Table of Contents

(back to main documentation)

11. D(h)MR calling workflow

  • The D(h)MR calling workflow is used to identify differentially methylated or differentially hydroxymethylated regions between groups of samples analysed using the duet pipeline.
  • The D(h)MR calling workflow uses the Zarr store output from the duet pipeline as an input. It can also take multiple Zarr stores from different executions of the duet pipeline (e.g from processing different sequencing runs) as an input.
  • The D(h)MR calling workflow requires a sample sheet listing the samples to be used for D(h)MR calling, the condition associated with each sample, and information about covariates that should be taken into account.
  • Samples are grouped by condition and DMR calling is performed between all pairwise combinations of conditions.
  • A typical example might be to call DMRs between a set of cancer samples and a set of healthy samples, accounting for sex and age as covariates
  • With duet +modC, DMRs are called based on modC (modC is the union of mC and hmC modifications).
  • With duet evoC, mC is distinguishable from hmC, and DMR calling is performed separately for mC modifications and for hmC modifications. It is also possible to call DMRs based on modC – i.e. on the union of mC and hmC calls.

11.1. Running the D(h)MR calling workflow

biomodal call_dmr                    Run D(h)MR analysis
  --input-path <input_path>              Required: full custom bucket uri or directory path with Zarr stores to be analysed
  --dmr-sample-sheet <sample_sheet_name> Required: full path and name of the DMR sample sheet  
  --output-path <output_path>            Required: custom output disk location or bucket uri
  --tag <tag>                            Required: custom tag to identify this DMR run
  --mode <5bp|6bp>                       Required: 5 base vs 6 base mode, default is '5bp'
  --condition <condition(s)>             Required: a single DMR sample sheet column that contains the conditions between which to call DMRs. (Note that 'group' cannot be used as a column name.)
  --covariates <covariates>              Optional: one or more DMR sample sheet columns that contain covariates to account for during DMR calling. (Note that more than 2 corvariates may increase runtime and memory requirements.)
  --dmr-bed-path <dmr_bed_path>          Optional: a path to a bed file defining regions that DMR calling should be restricted to
  --evoc-modifications <mc|hmc|modc>     Optional: single or comma separated list of duet evoC modifications to call, 'mc' or 'hmc' in 6bp mode only
  --min-depth <min_depth>                Optional: contexts will only be removed if coverage <= min-depth in ALL SAMPLES, default is '0'
  --window-size <window_size>            Optional: window size for DMR analysis, default is '1000' (Minimum = 200)
  --additional-params <params>           Optional: additional single or comma separated parameters, e.g. 'param1=value1,param2=value2'
  --run-name <run_name>                  Optional: custom run name to identify this analysis run
  --resume                               Optional: resume previous run using successfully completed tasks, default is 'false'
  --work-dir                             Optional: override the default path where the workflow temporary data is stored

Small example:

biomodal call_dmr \
  --input-path gs://my-bucket-name/zarr_files/ \
  --output-path gs://my-bucket-name/zarr_files/output \
  --dmr-sample-sheet gs://my-bucket-name/zarr_files/dmr_sample_sheet.tsv \
  --dmr-bed-path gs://my-bucket-name/zarr_files/my_bedfile.bed \
  --mode 5bp \
  --condition disease_state \
  --tag dmr_run1 \
  --run-name dmr_run1

(back to main documentation) | (back to top)

11.2. D(h)MR calling workflow input and output file requirements

11.2.1 D(h)MR calling workflow input file and parameters

Path to directory containing your Zarr file(s)

  • This should be a path to a directory containing the multi-sample Zarr store that got generated from running the duet pipeline, i.e. the sample_outputs/zarr_store/ subdirectory of duet pipeline outputs.
  • This can also be a directory where you have brought together several multi-sample Zarr stores generated from running the duet pipeline (for instance if you have separately processed multiple sequencing runs)
  • The Zarr store is a multi-sample, multi-dimensional, compressed, indexed data store.

Single type of multi-sample Zarr store for each workflow run

Please ensure that the input path includes only Zarr stores of a single cytosine context (CG, CHG, or CHH) before running the call_dmr workflow.
Separate Zarr stores of different cytosine contexts into distinct folders or paths as needed.

DMR sample sheet

  • The DMR sample sheet should be a tsv file with a header, and with one row per sample that you want included in the DMR calling
  • The columns should include a sample_id column, a column featuring the conditions between which you wish to call DMRs, and (optionally) columns for any covariates associated your samples. DMR calling will be performed between each pairwise comparison of conditions.
  • Please note that group cannot be used as a column name as this is a reserved word.
  • Here is an example of a sample sheet that would be suitable for calling DMRs between healthy and cancer samples, accounting for sex and age as covariates:
sample_id   disease_state     sex   age
BIO-A1      healthy            M     62
BIO-A2      healthy            M     58
BIO-A3      healthy            F     66
BIO-A4      cancer_stage1      F     66
BIO-A5      cancer_stage1      M     65
BIO-A6      cancer_stage1      M     55
BIO-A7      cancer_stage4      F     61
BIO-A8      cancer_stage4      F     59
  • Using the above example, the parameters --condition disease_state --covariates sex,age DMR calling would be performed sequentially for “healthy vs cancer_stage1”, “healthy vs cancer_stage4”, and “cancer_stage1 vs cancer_stage4” using “sex” and “age” as covariates.

DMR bed path

  • This should be a path to a bed file containing regions that DMR calling should be performed on. DMR calling will be performed between each condition for each region listed in the bed file, --dmr-bed-path cannot be used in conjuction with --window-size. For instance, you may wish to restrict DMR calling to a set of promoters that you are interested in.

  • BED format is zero-based and half-open, meaning the first position on contig is 0, and the End coordinate is excluded from the interval. Note: some methylation quantification outputs from the duet pipeline (e.g. cxreport) have 1-based positions.

chr1   1000171    1001137   
chr1   1019119    1020119   
chr1   1172879    1173879   
chr1   1232030    1232236   
chr1   1279435    1280435   
chr1   10031831   10032831  
chr1   10209569   10210569  
chr1   10397591   10398591  
  • If you wish to perform Differentially Methylated Cytosine (DMC) analysis, i.e single CpG regions, you will need to pass a BED which for the regions are single CpGs. Note: The inputted DMR BED cannot contain more than 5 million intervals.
chr1   10031845   10031846
chr1   10031846   10031847
chr1   10031915   10031916
chr1   10031916   10031917
chr1   10031917   10031918
chr1   10031918   10031919
chr1   10031925   10031926
chr1   10031926   10031927
chr1   10031947   10031948
chr1   10031948   10031949
chr1   10031949   10031950
chr1   10031950   10031951

Window size

  • If a dmr-bed-path is not supplied, DMRs will be called genome-wide. The genome will be split into window-size bp (default = 1000) regions to perform DMR analysis on. Note: if a DMR BED is supplied, window-size will ignored.

  • Reducing the window size for genome-wide DMR calling increases the runtime and memory requirement. The minimum supported window size is 200bp.

evoC modifications

In 5bp mode (duet +modC) DMRs can only be called on modC. In 6bp mode (duet evoC) DMRs will be called on mC and hmC separately by default. However, it is possible to call DMRs on any combination of mC, hmC or combined modC.

  • To call hmC DMRs only at enhancer regions use --dmr-bed-path full-path/to-your/enhancer-intervals.bed --evoc-modifications hmc
  • To call mC DMRs only at promotor regions use --dmr-bed-path full-path/to-your/promoter-intervals.bed --evoc-modifications mc
  • To call mC, hmC, and combined modC DMRs on 500bp regions genomewide use --window-size 500 --evoc-modifications mc,hmc,modc

Min Depth

If --min-depth is supplied, CG contexts will be removed only if coverage <= min-depth in ALL SAMPLES. Note: this is stranded CG coverage

11.2.2 D(h)MR calling workflow output file formats and locations

D(h)MR output BED

D(h)MR results are output in BED format, with the following fields.

Field Description
Chromosome Intervals over which DMR is called
Start Start position of the interval
End End position of the interval
Name Reserved for future use
Score Reserved for future use
Strand Reserved for future use
num_contexts Number of CpG contexts in the interval
mean_coverage_across_samples Mean CpG coverage across all samples
mean_mod_group_1 Mean modification fraction in group 1
mean_mod_group_2 Mean modification fraction in group 2
mod_fold_change Fold change between group 1 and group 2
mod_difference Difference between mean_mod_group_1 and mean_mod_group_2
test_statistic Difference between the log likelihood of the full model (with group specified) and the null model (without group specified)
dmr_pvalue Unadjusted p-value
dmr_qvalue Adjusted p-value (Benjamini-Hochberg)

If the above DMR samplesheet was analysed in 6bp, we’d expect the following outputs to be generated in the --output-path directory:

DMR_num_mc_healthy_cancer_stage1.bed
DMR_num_mc_healthy_cancer_stage4.bed
DMR_num_mc_cancer_stage1_cancer_stage4.bed
DMR_num_hmc_healthy_cancer_stage1.bed
DMR_num_hmc_healthy_cancer_stage4.bed
DMR_num_hmc_cancer_stage1_cancer_stage4.bed

D(h)MR output summary

  • An additional summary.txt file is generated per output BED file which lists metadata about the DMR including the version of the biomodal software package used, the input files, and the full commands used. It also reports the number of DMRs which pass a q < 0.05 significance threshold.

(back to main documentation) | (back to top)

Cambridge Epigenetix is now biomodal