
Chapter 2 Installation and Setting Reference genome
2.1 Learning Objectives
This chapter will cover:
- Installation
- using setuptools
- using pip
- Steps for setting reference genome
- Create GC and mask file for new reference genome
2.2 Libraries
CNVpytor is written in python and it works on both python 2 and 3. Please install python before proceeding with the installation steps.
The following code can be used to check python version
python --version
2.3 Installation
2.3.1 Install by cloning from GitHub
The following lines of codes can be used to install directly from GitHub
> git clone https://github.com/abyzovlab/CNVpytor.git
> cd CNVpytor
> pip install .
For single user (without admin privileges), can use:
> pip install --user .
2.3.2 Install using pip
The following code will download the latest code from GitHub and use pip to install.
pip install git+https://github.com/abyzovlab/CNVpytor.git
2.4 Steps for setting reference genome
Commonly used human reference genomes (HG19 and HG38) are integrated and comes with default GitHub installation. Its detects the genome by comparing the chromosome lengths. Although, other reference genomes are also frequently used in practice. This section will guide one to add a new reference genome.
Now, we will create example configuration file for mouse reference genome MGSCv37
which contains a list of chromosomes and chromosome lengths:
# Filename: example_ref_genome_conf.py
import_reference_genomes = {
"mm9": {
"name": "MGSCv37",
"species": "Mus musculus",
"chromosomes": OrderedDict(
[("chr1", (197195432, "A")), ("chr2", (181748087, "A")), ("chr3", (159599783, "A")),
("chr4", (155630120, "A")), ("chr5", (152537259, "A")), ("chr6", (149517037, "A")),
("chr7", (152524553, "A")), ("chr8", (131738871, "A")), ("chr9", (124076172, "A")),
("chr10", (129993255, "A")), ("chr11", (121843856, "A")), ("chr12", (121257530, "A")),
("chr13", (120284312, "A")), ("chr14", (125194864, "A")), ("chr15", (103494974, "A")),
("chr16", (98319150, "A")), ("chr17", (95272651, "A")), ("chr18", (90772031, "A")),
("chr19", (61342430, "A")), ("chrX", (166650296, "S")), ("chrY", (15902555, "S")),
("chrM", (16299, "M"))]),
"gc_file": "/..PATH../MGSCv37_gc_file.pytor",
"mask_file": "/..PATH../MGSCv37_mask_file.pytor"
}
}
Letter next to chromosome length denote type of a chromosome: A - autosome, S - sex chromosome, M - mitochondria. The instruction for creating MGSCv37_gc_file.pytor
and MGSCv37_mask_file.pytor
is in the next section.
To use CNVpytor with new reference genome us -conf
option in each cnvpytor
command, e.g.
cnvpytor -conf REL_PATH/example_ref_genome_conf.py -root file.pytor -rd file.bam
CNVpytor will use chromosome lengths from alignment file to detect reference genome. However, if you configured reference genome after you had already run -rd step you could assign reference genome using -rg
:
cnvpytor -conf REL_PATH/example_ref_genome_conf.py -root file.pytor -rg mm9
To avoid typing -conf REL_PATH/example_ref_genome_conf.py
each time you run cnvpytor
, you can create an bash alias or make configuration permanent by copying example_ref_genome_conf.py
to ~/.cnvpytor/reference_genomes_conf.py
.
2.4.1 Create GC and mask file for new reference genome
CNVpytor also has optional features for GC correction and masking (i.e., commonly known false positive regions). One can setup their reference genome by adding its related content in the gc_file
and mask_file
field of the configuration file.
To create GC
file, we need sequence of the reference genome in fasta.gz
file:
> cnvpytor -root MGSCv37_gc_file.pytor -gc ~/hg19/mouse.fasta.gz -make_gc_file
This command will produce MGSCv37_gc_file.pytor
file that contains information about GC content in 100-base-pair bins.
For reference genomes where we have strict mask in the same format as 1000 Genomes Project strict mask, we can create mask file using command:
> cnvpytor -root MGSCv37_mask_file.pytor -mask ~/hg19/mouse.strict_mask.whole_genome.fasta.gz -make_mask_file
If you do not have mask file, You can skip this step. Mask file contains information about regions of the genome that are more accessible to next generation sequencing methods using short reads. CNVpytor uses P marked positions to filter SNP-s and read depth signal. If reference genome configuration does not contain mask file, CNVpytor will still be fully functional, apart from the filtering step. You may also generate your own mask file by creating fasta file that contains character āPā if corresponding base pair passes the filter and any character different than āPā if not.