Reference-genome

10. Reference genome pipeline

(back to main documentation)

You can download the reference genome pipeline from the biomodal CLI. The reference genome pipeline is used to create new reference genomes for the duet pipeline. The reference genome pipeline is a separate pipeline from the duet pipeline and is not required to run the duet pipeline using the default reference genomes provide by biomodal.

10.1 Checking available reference pipeline versions

The reference pipeline is not downloaded during a new biomodal software install by default. To download the files necessary to run the reference pipeline, compatible version(s) of the reference pipeline must be downloaded. To view the list of reference pipeline versions available/compatible that can be downloaded in the next step, run the following command:

biomodal reference pipeline list

(back to main documentation) | (back to top)

10.2 Reference pipeline download

This step will download and set up the necessary files and scripts to run the reference pipeline.

To download a new version of the reference genome pipeline, run the following command:

biomodal reference pipeline download <version>

The version number should be the version you want to download. You can find the available versions by running the biomodal reference pipeline list command.

Please note that if you have already downloaded a version of the reference genome pipeline, downloading an update of the duet pipeline will download updates the reference pipeline should this be required.

Downloading this pipeline requres a new parameter entry ref_pipeline_version in the config.yaml file. This will be automatically added when you download the reference pipeline. Please note that the reference pipeline version is not the same as the duet pipeline version.

(back to main documentation) | (back to top)

10.3 Make new reference genomes

10.3.1 Running the reference genome pipeline

To make new reference genomes using the reference genome pipeline, run the following command:

  biomodal reference make                         
    --input-path <your-input-path>
    --output-path <your-output-path>               
    --species <reference-genome-species>                    
    --reference-genome <reference-genome-official-name>

parameter	Description
`--input-path`	Path for the reference genome gzipped FASTA file
`--output-path`	Custom output disk location or bucket url
`--species`	Required: reference species, e.g. ‘Homo_sapiens’, ‘Mus_musculus’
`--reference-genome`	Genome Reference Consortium official name, e.g. ‘GRCh38Decoy’, ‘hs38DH’

Please note that all parameters are required to run this pipeline.

The reference pipeline will not write directly to the central duet pipeline reference location, but will write to a staging location specified in the --output-path parameter. This enables manual verification of the output before moving it to the reference location. And it prevents the pipeline from needing write access to the reference location.

10.3.2. Reference genome input and output files requirements

10.3.2.1 Reference genome input file requirements

To run the pipeline on a new species, it is necessary to generate new reference files by running the reference file pipeline. This takes a gzipped FASTA file for the reference genome.

With the exception of the new FASTA reference genome file, all of the files above are supplied with the duet pipeline and are assumed to have been downloaded and unpackaged according to the instructions supplied. These files will be automatically detected when the Alternative Reference Genome Pipeline is run.

The reference genome for the species required should be obtained from an appropriate source, e.g. the Genome Reference Consortium or by searching the Ensembl or EnsemblPlants genomes database.

When you have downloaded the gzipped FASTA file, please provide the full location and filename in the --input-path parameter. The pipeline will automatically detect the file and use it to generate the reference genome.

Some further notes about preparing the FASTA reference genome:

Some sources supply reference genomes in FASTA format but not gzipped, for example inside a zip bundle with some metadata files. In these cases, you will need to gzip your FASTA file before you can use it as an input to the reference genome pipeline by running gzip /path/to/my/reference.fasta
The contig names inside the FASTA file (the lines that start with the > character) are important since they are used inside the duet pipeline. Sometimes these contigs have long names; so that the duet pipeline can run effectively, you will need to either rename these contigs to match those you add to the genome config file or supply a mapping to map the long name to the names you supply in the config. For example, your contig name might be >NC_044371.1 Cannabis sativa chromosome 1, cs10, whole genome shotgun sequence and you might want to rename it to >chr1 so that you can just add chr1 to the list of contigs in the genome config file.
Make sure the contigs in the reference FASTA contain what you expect:
- Not all reference downloads contain the mitochondrial chromosome, which might be important for your analysis.
- There may be other contigs that are needed for your reference that are downloaded separately, such as the chloroplasts for a plant genome.
- Many references contain unmapped scaffold contigs; these may be important but should probably be excluded from the list of contigs you supply in the genome config file.
- You can inspect the contigs by running grep ">" /path/to/my/reference.fasta.

(back to main documentation) | (back to top)

10.3.2.2 Reference genome output file locations

The output files will be stored in the location specified by the --output-path parameter above. Here you will find the following top level sub-directories:

Location	Description
`reference_genomes/<reference-genome>/`	Reference genome output
`ss_ctrls/<ctrls>/`	Controls output

In the reference genome output location, the files are:

Location	Description
`<reference-genome>-ss-ctrls-<ctrls>.fa.gz`	Combined reference genome and controls FASTA
`<reference-genome>-ss-ctrls-<ctrls>.fa.fai`	Combined reference genome and controls FASTA index
`<reference-genome>.bed`	BED file containing reference genome contig definitions
`<reference-genome>.chrom.sizes`	List of reference genome contig sizes
`<reference-genome>.chrom.txt`	List of reference genome contig names
`<reference-genome>.dict`	Very similar to the BED file, list of all reference genome contigs in dictionary format required by GATK
`<reference-genome>-ss-ctrls-<ctrls>.dict`	List of both reference genome and controls contigs in dictionary format
`<reference-genome>_primary_assembly.bed`	Primary assembly BED for reference genome. This will be exactly the same as the reference genome bed, except for human, mouse or zebrafish

After the pipeline has run successfully, the reference genome files, and a set of existing couplet q-tables files and controls, should be copied to the shared reference genomes folder where the existing genomes are held. This is the same location as the default reference genomes provided when you installed the duet pipeline. (<default bucket/folder location of your pipeline data>/reference_files/).
Example using GCP storage bucket:

# Set the relevant reference genome version
REF_VERSION=1.0.1

# Copy the reference genome files to the shared reference genomes folder
gcloud storage cp --recursive \
 "gs://bucket-name/reference-output-folder/reference_genomes/GRCz11/*" \
 gs://bucket-name/reference_files/${REF_VERSION}_GRCz11/duet/duet-ref-${REF_VERSION}/reference_genomes/GRCz11/

# Copy existing couplet qtables to the new reference genome folder
gcloud storage cp --recursive \
 gs://bucket-name/reference_files/${REF_VERSION}_GRCh38Decoy/duet/couplet \
 gs://bucket-name/reference_files/${REF_VERSION}_GRCz11/duet/

# Copy existing controls to the new reference genome folder
gcloud storage cp --recursive \
 gs://bucket-name/reference_files/${REF_VERSION}_GRCh38Decoy/duet/duet-ref-${REF_VERSION}/ss_ctrls \
 gs://bucket-name/reference_files/${REF_VERSION}_GRCz11/duet/duet-ref-${REF_VERSION}/

The controls output files are:

Location	Description
`ss-ctrls-long-v24.bed`	BED file containing long control contig definitions
`ss-ctrls-long-v24.chrom.txt`	BED file containing long control contig names
`ss-ctrls-long-v24.chrom.sizes`	BED file containing long control contig sizes
`ss-ctrls-short-v24.bed`	BED file containing short control contig definitions
`ss-ctrls-short-v24.chrom.sizes`	BED file containing short control contig sizes
`ss-ctrls-short-v24.chrom.txt`	BED file containing short control contig names

Note that the controls output files will be exactly the same as those already provided with the duet pipeline.

The original input files are also copied to the reference genome/controls subdirectory, as applicable.

(back to main documentation) | (back to top)

10.3.3. Reference genome config file

Once the pipeline has been run successfully and outputs checked, the reference genome directory (i.e. <output-path>/references_genomes/GRCz11)
should be copied to the reference_genomes folder where the existing genomes are held.

The CLI will generate an config file template. The alternative reference can be used by providing a set of specified fields to the pipeline in the correct format, for example:

params {
  genomes {     
    '<<ref_genome>>' {
        contigs          =    ''
        mapping          =    false
        autosomes        =    ''
        mitochondria     =    ''
        primary_assembly =    true
        primary_assembly_bed = "<<ref_genome_path>>/duet/duet-ref-<<ref_pipeline_version>>/reference_genomes/<<ref_genome>>/<<ref_genome>>_primary_assembly.bed"
        target_intervals =    false
        species          =    '<<species>>'
    }
  }
}

This reference genome template file should be edited to include the correct sections for the new reference genome. For any new reference genome, you will need to define these parameters as follows:

Field	Description, Usage and Optionality
`contigs`	A list of chromosomes. This is used: (1) during methylation quantification as a specification of the primary chromosomes over which summary methylation quantification metrics should be calculated: mean CpG methylation rates will be calculated for each chromosome in this list; (2) during variant calling as a specification of the ‘large’ primary assembly contigs. This information is used to support parallelisation of variant calling processes: variant calling will be performed separately in parallel for chromosomes in this list. Variant calling on any other contigs (such as alt contigs or decoy sequences) will be performed in one further parallel step.
`mapping`	A regular expression that can be specified to combine methylation quantification summary statistics from contigs that have a common prefix. If the regular expression matches a contig name, then quantification statistics from that contig will be attributed to the first matching group returned when the regular expression is applied to the contig name. For example, the regular expression `"([^_]+)_.+"` (i.e. multiple characters up to, but not including, the first underscore) when applied to the contig `chr1_KI270706v1_random` would return `chr1` as the first matching group, so would cause quantification data associated with `chr1_KI270706v1_random` to be attributed to `chr1`.
`autosomes`	A list of the autosomes. This is used during methylation quantification to specify the chromosomes that should be used when calculating genome-wide methylation levels. This excludes the allosomes as they would cause a sex-bias in the calculation of genome-wide methylation levels.
`mitochondria`	The name of the mitochondrial chromosome in the reference genome. This is used during methylation quantification to identify the contig over which the mitochondrial methylation rate will be calculated.
`primary_assembly`	A Boolean field indicating whether there is a separate subset of contigs that are considered to be the ‘primary assembly’. This should be set to “false” as a distinct primary assembly is not currently generated for alternative reference genomes.
`species`	Binomial name for the species.
`primary_assembly_bed`	The path to a primary assembly bed file if relevant. Usage is described above.
`<<ref_genome>>`	Reference Genome Consortium name supplied when launching the reference pipeline
`<<species>>`	Species supplied when launching the reference pipeline
`<<ref_genome_path>>`	Automatically applied when launching the reference pipeline. Location of the final reference genome files. This is the shared location expected by the duet pipeline
`<<ref_pipeline_version>>`	Automatically applied when launching the reference pipeline, additionally stored in the CLI `config.yaml` file. This is set during the `biomodal reference pipeline download` stage

The filename will be in the same data-bucket location as for other CLI results and will have the name of the chosen reference genome. Example is Ptrichocarpa_v4-1.config if you launched the pipeline with the reference genome Ptrichocarpa_v4-1.

(back to main documentation) | (back to top)

10.4. Module hardware requirements

Please see the complete overview of module requirements for the reference pipeline.

The module MAKE_BWA_MEM2_INDEX may require a large amount of memory but can be run on a single core. Looking at the BWA_MEM2 documentation, the recommendation is to allocate about 28 x the genome size. For example, the human genome is around 3 gigabases, and therefore requires 3 x 28 = ~84GB RAM. Note that this figure needs to include any additional decoy contigs that are part of the reference fasta you are using.

You may have to add a resource section in the nextflow.config file in the biomodal script folder to specify the hardware requirements for the MAKE_BWA_MEM2_INDEX module in the reference pipeline.

For example:

process { 
  withName: 'MAKE_BWA_MEM2_INDEX' { cpus = 1
                                    memory = "320GB"}
}

(back to main documentation) | (back to top)

10.5. Using new reference genomes in the duet pipeline

To use this new reference genome with the biomodal duet +modC evoC multiomics solution bioinformatics pipeline, please include the following parameter to the biomodal analyse command:

--reference-genome <reference-genome>
--reference-genome-profile <reference-genome-profile-file-path>

The --reference-genome parameter should be set to the name of the reference genome you have created. The --reference-genome-profile parameter should be set to the path of the reference genome profile you have created. The default location of this file is in the output directory you specified for the biomodal make reference command.

(back to main documentation) | (back to top) | (Next)

Products

Discover

Applications and research areas

Explore

Resources

Learn

About biomodal

Connect

Reference-genome

Table of Contents

10.1 Checking available reference pipeline versions

10.2 Reference pipeline download

10.3 Make new reference genomes

10.3.1 Running the reference genome pipeline

10.3.2. Reference genome input and output files requirements

10.3.2.1 Reference genome input file requirements

10.3.2.2 Reference genome output file locations

10.3.3. Reference genome config file

10.4. Module hardware requirements

10.5. Using new reference genomes in the duet pipeline

Discover

Explore

Connect

Legal

Cambridge Epigenetix is now biomodal