Chapter 2 Installation and Setting Reference genome

2.1 Learning Objectives

This chapter will cover:

  • Installation
    • using setuptools
    • using pip
  • Steps for setting reference genome
    • Create GC and mask file for new reference genome

2.2 Libraries

CNVpytor is written in python and it works on both python 2 and 3. Please install python before proceeding with the installation steps.

The following code can be used to check python version

python --version

2.3 Installation

2.3.1 Install by cloning from GitHub

The following lines of codes can be used to install directly from GitHub

> git clone https://github.com/abyzovlab/CNVpytor.git
> cd CNVpytor
> pip install .

For single user (without admin privileges), can use:

> pip install --user .

2.3.2 Install using pip

The following code will download the latest code from GitHub and use pip to install.

pip install git+https://github.com/abyzovlab/CNVpytor.git

2.4 Steps for setting reference genome

Commonly used human reference genomes (HG19 and HG38) are integrated and comes with default GitHub installation. Its detects the genome by comparing the chromosome lengths. Although, other reference genomes are also frequently used in practice. This section will guide one to add a new reference genome.

Now, we will create example configuration file for mouse reference genome MGSCv37 which contains a list of chromosomes and chromosome lengths:

# Filename: example_ref_genome_conf.py

import_reference_genomes = {
    "mm9": {
        "name": "MGSCv37",
        "species": "Mus musculus",
        "chromosomes": OrderedDict(
            [("chr1", (197195432, "A")), ("chr2", (181748087, "A")), ("chr3", (159599783, "A")),
            ("chr4", (155630120, "A")), ("chr5", (152537259, "A")), ("chr6", (149517037, "A")),
            ("chr7", (152524553, "A")), ("chr8", (131738871, "A")), ("chr9", (124076172, "A")),
            ("chr10", (129993255, "A")), ("chr11", (121843856, "A")), ("chr12", (121257530, "A")),
            ("chr13", (120284312, "A")), ("chr14", (125194864, "A")), ("chr15", (103494974, "A")),
            ("chr16", (98319150, "A")), ("chr17", (95272651, "A")), ("chr18", (90772031, "A")),
            ("chr19", (61342430, "A")), ("chrX", (166650296, "S")), ("chrY", (15902555, "S")),
            ("chrM", (16299, "M"))]),
        "gc_file": "/..PATH../MGSCv37_gc_file.pytor",
        "mask_file": "/..PATH../MGSCv37_mask_file.pytor"
    }
}

Letter next to chromosome length denote type of a chromosome: A - autosome, S - sex chromosome, M - mitochondria. The instruction for creating MGSCv37_gc_file.pytor and MGSCv37_mask_file.pytor is in the next section.

To use CNVpytor with new reference genome us -conf option in each cnvpytor command, e.g.

cnvpytor -conf REL_PATH/example_ref_genome_conf.py -root file.pytor -rd file.bam

CNVpytor will use chromosome lengths from alignment file to detect reference genome. However, if you configured reference genome after you had already run -rd step you could assign reference genome using -rg:

cnvpytor -conf REL_PATH/example_ref_genome_conf.py -root file.pytor -rg mm9

To avoid typing -conf REL_PATH/example_ref_genome_conf.py each time you run cnvpytor, you can create an bash alias or make configuration permanent by copying example_ref_genome_conf.py to ~/.cnvpytor/reference_genomes_conf.py.

2.4.1 Create GC and mask file for new reference genome

CNVpytor also has optional features for GC correction and masking (i.e., commonly known false positive regions). One can setup their reference genome by adding its related content in the gc_file and mask_file field of the configuration file.

To create GC file, we need sequence of the reference genome in fasta.gz file:

> cnvpytor -root MGSCv37_gc_file.pytor -gc ~/hg19/mouse.fasta.gz -make_gc_file

This command will produce MGSCv37_gc_file.pytor file that contains information about GC content in 100-base-pair bins. For reference genomes where we have strict mask in the same format as 1000 Genomes Project strict mask, we can create mask file using command:

> cnvpytor -root MGSCv37_mask_file.pytor -mask ~/hg19/mouse.strict_mask.whole_genome.fasta.gz -make_mask_file

If you do not have mask file, You can skip this step. Mask file contains information about regions of the genome that are more accessible to next generation sequencing methods using short reads. CNVpytor uses P marked positions to filter SNP-s and read depth signal. If reference genome configuration does not contain mask file, CNVpytor will still be fully functional, apart from the filtering step. You may also generate your own mask file by creating fasta file that contains character ā€œPā€ if corresponding base pair passes the filter and any character different than ā€œPā€ if not.