Automatic characterization of copy number polymorphism using high throughput sequencing

Can Alkan

doi:10.3906/elk-1903-135

Abstract

Genome structural variation, broadly defined as alterations longer than 50 bp, are important sources for genetic variation among humans, including those that cause complex diseases such as autism, developmental delay, and schizophrenia. Although there has been considerable progress in characterizing structural variation since the beginnings of the 1000 Genomes Project, one form of structural variation called segmental duplications (SDs) remained largely understudied in large cohorts. This is mostly because SDs cannot be accurately discovered using the alignment files generated with standard read mapping tools. Instead, they can only be found when multiple map locations are considered. There is still a single algorithm available for SD discovery, which includes various tools and scripts that are not portable and are difficult to use. Additionally, this algorithm relies on a priori information for regions where no structural variations are discovered in large number of genomes. Therefore, there is a need for fully automated, portable, and user-friendly tools to make SD characterization a part of genome analyses. Here we introduce such an algorithm and efficient implementation, called \mrcanavar, that aims to fill this gap in genome analysis toolbox.

Highlights

The changes in DNA sequences are classified depending on their size and organization
It uses the mrsFAST aligner that we have previously developed for tracking multiple map locations for accurate copy number variation (CNV) discovery [45]. mrCaNaVaR can take as input raw FASTQ files, or alignment files in BAM [46] or CRAM [47] format generated with any read mapper such as BWA-MEM [48]. arXiv 2013; arXiv:13033997. and Bowtie2 [49]
We provide BAM and CRAM file support to enable mrCaNaVaR use for data sets where the raw FASTQ files are deleted after alignment

Summary

Introduction

The changes in DNA sequences are classified depending on their size and organization. The smallest form of genomic variation, called single nucleotide variation (SNV), are single basepair substitutions between two segments of DNA sequences [1], typically called sample and reference. There can be insertions and deletions of short sequences (1–50 bp), named indel polymorphisms [2]. Other forms of genomic variation include expansion and contraction of short tandem repeats (microsatellite polymorphisms) [3], balanced rearrangements such as inversions [4] and translocations [5], and copy number variation (CNV) [6]. CNVs, by definition, alter the amount of DNA material in cells, and they can be deletions, insertions, and duplications of genomic segments [7], as well as mobile element retrotranspositions [8]. The 1000 Genomes Project that ran between 2008 and 2015 generated the most comprehensive map of genomic variation in the genomes of 2504 individuals from 26 populations, aiming to characterize genetic diversity within the human species [9,10,11]

Objectives

Methods

Results

Conclusion