Analysing methylation data is challenging, many existing analysis tools are difficult to work with and do not scale well as the number of samples increases. This lack of scalability means that standard analyses, such as identifying differentially methylated regions (DMRs), or summarising methylation fractions over genomic regions, require substantial time and memory – typically necessitating large scale compute infrastructure (e.g. compute clusters, cloud). A recently introduced technology, duet multiomics solution evoC, enables the reading of 5-base and 6-base information from DNA, further increasing the scale and complexity of data that can be extracted from a single sequencing experiment. With this expanded multiomic data-set, the lack of scalability is compounded, hindering the kinds of interactive data analysis that provide rapid and detailed insight. To address this, we present a fast and scalable array-based python package for the analysis of 5 and 6-base genomes (genetics, 5-mC and 5-hmC), leveraging `dask`, with `zarr` as the storage backend, allowing extremely efficient computation, even for datasets that are too large to fit into memory.
We demonstrate how our approach unlocks local analysis for large data cohorts, scaling to thousands of samples, with the major limiting factor being storage space rather than compute. By contrast, existing analysis packages scale poorly and exceed the memory capacity of a typical laptop after ~10 samples. Beyond pure performance benefits, we provide an optimised toolkit of functions to perform exploratory analysis (e.g. plotting, summarisation of methylation information across arbitrary genomic regions, summary statistics) as well as downstream analysis e.g. identifying DMRs and principal component analysis. The design of the package was focussed around providing a computationally efficient and intuitive interface to perform detailed analyses of methylation data, allowing users to quickly move from raw data to actionable insights and publication-ready analyses/figures. It is suited to both interactive analysis and incorporation into data pipelines for optimised processing at any scale. Moving forward into the era of multiomics, the underlying data we use offers powerful integration of additional data modalities, facilitating efficient multiomic analysis.