Table of Contents
11. D(h)MR calling workflow
- The D(h)MR calling workflow is used to identify differentially methylated or differentially hydroxymethylated regions between groups of samples analysed using the duet pipeline.
- The D(h)MR calling workflow uses the Zarr store output from the duet pipeline as an input. It can also take multiple Zarr stores from different executions of the duet pipeline (e.g from processing different sequencing runs) as an input.
- The D(h)MR calling workflow requires a sample sheet listing the samples to be used for D(h)MR calling, the condition associated with each sample, and information about covariates that should be taken into account.
- Samples are grouped by condition and DMR calling is performed between all pairwise combinations of conditions.
- A typical example might be to call DMRs between a set of cancer samples and a set of healthy samples, accounting for sex and age as covariates
- With duet +modC, DMRs are called based on modC (modC is the union of mC and hmC modifications).
- With duet evoC, mC is distinguishable from hmC, and DMR calling is performed separately for mC modifications and for hmC modifications. It is also possible to call DMRs based on modC – i.e. on the union of mC and hmC calls.
11.1. Running the D(h)MR calling workflow
biomodal call_dmr Run D(h)MR analysis
--input-path <input_path> Required: full custom bucket uri or directory path with Zarr stores to be analysed
--dmr-sample-sheet <sample_sheet_name> Required: full path and name of the DMR sample sheet
--output-path <output_path> Required: custom output disk location or bucket uri
--tag <tag> Required: custom tag to identify this DMR run
--mode <5bp|6bp> Required: 5 base vs 6 base mode, default is '5bp'
--condition <condition(s)> Required: a single DMR sample sheet column that contains the conditions between which to call DMRs. (Note that 'group' cannot be used as a column name.)
--covariates <covariates> Optional: one or more DMR sample sheet columns that contain covariates to account for during DMR calling. (Note that more than 2 corvariates may increase runtime and memory requirements.)
--dmr-bed-path <dmr_bed_path> Optional: a path to a bed file defining regions that DMR calling should be restricted to
--evoc-modifications <mc|hmc|modc> Optional: single or comma separated list of duet evoC modifications to call, 'mc' or 'hmc' in 6bp mode only
--min-depth <min_depth> Optional: contexts will only be removed if coverage <= min-depth in ALL SAMPLES, default is '0'
--window-size <window_size> Optional: window size for DMR analysis, default is '1000' (Minimum = 200)
--additional-params <params> Optional: additional single or comma separated parameters, e.g. 'param1=value1,param2=value2'
--run-name <run_name> Optional: custom run name to identify this analysis run
--resume Optional: resume previous run using successfully completed tasks, default is 'false'
--work-dir Optional: override the default path where the workflow temporary data is stored
Small example:
biomodal call_dmr \
--input-path gs://my-bucket-name/zarr_files/ \
--output-path gs://my-bucket-name/zarr_files/output \
--dmr-sample-sheet gs://my-bucket-name/zarr_files/dmr_sample_sheet.tsv \
--dmr-bed-path gs://my-bucket-name/zarr_files/my_bedfile.bed \
--mode 5bp \
--condition disease_state \
--tag dmr_run1 \
--run-name dmr_run1
(back to main documentation) | (back to top)
11.2. D(h)MR calling workflow input and output file requirements
11.2.1 D(h)MR calling workflow input file and parameters
Path to directory containing your Zarr file(s)
- This should be a path to a directory containing the multi-sample Zarr store that got generated from running the duet pipeline, i.e. the
sample_outputs/zarr_store/
subdirectory of duet pipeline outputs. - This can also be a directory where you have brought together several multi-sample Zarr stores generated from running the duet pipeline (for instance if you have separately processed multiple sequencing runs)
- The Zarr store is a multi-sample, multi-dimensional, compressed, indexed data store.
Single type of multi-sample Zarr store for each workflow run
Please ensure that the input path includes only Zarr stores of a single cytosine context (CG, CHG, or CHH) before running the call_dmr workflow.
Separate Zarr stores of different cytosine contexts into distinct folders or paths as needed.
DMR sample sheet
- The DMR sample sheet should be a tsv file with a header, and with one row per sample that you want included in the DMR calling
- The columns should include a
sample_id
column, a column featuring the conditions between which you wish to call DMRs, and (optionally) columns for any covariates associated your samples. DMR calling will be performed between each pairwise comparison of conditions. - Please note that
group
cannot be used as a column name as this is a reserved word. - Here is an example of a sample sheet that would be suitable for calling DMRs between healthy and cancer samples, accounting for sex and age as covariates:
sample_id disease_state sex age
BIO-A1 healthy M 62
BIO-A2 healthy M 58
BIO-A3 healthy F 66
BIO-A4 cancer_stage1 F 66
BIO-A5 cancer_stage1 M 65
BIO-A6 cancer_stage1 M 55
BIO-A7 cancer_stage4 F 61
BIO-A8 cancer_stage4 F 59
- Using the above example, the parameters
--condition disease_state --covariates sex,age
DMR calling would be performed sequentially for “healthy vs cancer_stage1”, “healthy vs cancer_stage4”, and “cancer_stage1 vs cancer_stage4” using “sex” and “age” as covariates.
DMR bed path
-
This should be a path to a bed file containing regions that DMR calling should be performed on. DMR calling will be performed between each condition for each region listed in the bed file,
--dmr-bed-path
cannot be used in conjuction with--window-size
. For instance, you may wish to restrict DMR calling to a set of promoters that you are interested in. -
BED format is zero-based and half-open, meaning the first position on contig is 0, and the End coordinate is excluded from the interval. Note: some methylation quantification outputs from the duet pipeline (e.g. cxreport) have 1-based positions.
chr1 1000171 1001137
chr1 1019119 1020119
chr1 1172879 1173879
chr1 1232030 1232236
chr1 1279435 1280435
chr1 10031831 10032831
chr1 10209569 10210569
chr1 10397591 10398591
- If you wish to perform Differentially Methylated Cytosine (DMC) analysis, i.e single CpG regions, you will need to pass a BED which for the regions are single CpGs. Note: The inputted DMR BED cannot contain more than 5 million intervals.
chr1 10031845 10031846
chr1 10031846 10031847
chr1 10031915 10031916
chr1 10031916 10031917
chr1 10031917 10031918
chr1 10031918 10031919
chr1 10031925 10031926
chr1 10031926 10031927
chr1 10031947 10031948
chr1 10031948 10031949
chr1 10031949 10031950
chr1 10031950 10031951
Window size
-
If a
dmr-bed-path
is not supplied, DMRs will be called genome-wide. The genome will be split intowindow-size
bp (default = 1000) regions to perform DMR analysis on. Note: if a DMR BED is supplied,window-size
will ignored. -
Reducing the window size for genome-wide DMR calling increases the runtime and memory requirement. The minimum supported window size is 200bp.
evoC modifications
In 5bp mode (duet +modC) DMRs can only be called on modC. In 6bp mode (duet evoC) DMRs will be called on mC and hmC separately by default. However, it is possible to call DMRs on any combination of mC, hmC or combined modC.
- To call hmC DMRs only at enhancer regions use
--dmr-bed-path full-path/to-your/enhancer-intervals.bed --evoc-modifications hmc
- To call mC DMRs only at promotor regions use
--dmr-bed-path full-path/to-your/promoter-intervals.bed --evoc-modifications mc
- To call mC, hmC, and combined modC DMRs on 500bp regions genomewide use
--window-size 500 --evoc-modifications mc,hmc,modc
Min Depth
If --min-depth
is supplied, CG contexts will be removed only if coverage <= min-depth
in ALL SAMPLES. Note: this is stranded CG coverage
11.2.2 D(h)MR calling workflow output file formats and locations
D(h)MR output BED
D(h)MR results are output in BED format, with the following fields.
Field | Description |
---|---|
Chromosome |
Intervals over which DMR is called |
Start |
Start position of the interval |
End |
End position of the interval |
Name |
Reserved for future use |
Score |
Reserved for future use |
Strand |
Reserved for future use |
num_contexts |
Number of CpG contexts in the interval |
mean_coverage_across_samples |
Mean CpG coverage across all samples |
mean_mod_group_1 |
Mean modification fraction in group 1 |
mean_mod_group_2 |
Mean modification fraction in group 2 |
mod_fold_change |
Fold change between group 1 and group 2 |
mod_difference |
Difference between mean_mod_group_1 and mean_mod_group_2 |
test_statistic |
Difference between the log likelihood of the full model (with group specified) and the null model (without group specified) |
dmr_pvalue |
Unadjusted p-value |
dmr_qvalue |
Adjusted p-value (Benjamini-Hochberg) |
If the above DMR samplesheet was analysed in 6bp, we’d expect the following outputs to be generated in the --output-path
directory:
DMR_num_mc_healthy_cancer_stage1.bed
DMR_num_mc_healthy_cancer_stage4.bed
DMR_num_mc_cancer_stage1_cancer_stage4.bed
DMR_num_hmc_healthy_cancer_stage1.bed
DMR_num_hmc_healthy_cancer_stage4.bed
DMR_num_hmc_cancer_stage1_cancer_stage4.bed
D(h)MR output summary
- An additional
summary.txt
file is generated per output BED file which lists metadata about the DMR including the version of the biomodal software package used, the input files, and the full commands used. It also reports the number of DMRs which pass aq < 0.05
significance threshold.