A multi-sample based method for identifying common CNVs in normal human genomic structure using high-resolution aCGH data.

Chihyun Park,Sanghyun Park,Jaegyoon Ahn,Youngmi Yoon

doi:10.1371/journal.pone.0026975

Chihyun Park, Sanghyun Park + Show 2 more

Open Access

https://doi.org/10.1371/journal.pone.0026975

Copy DOI

Journal: PloS one	Publication Date: Oct 31, 2011
Citations: 31	License type: CC BY 4.0

Affiliation: Yonsei University, Gachon University

Abstract

BackgroundIt is difficult to identify copy number variations (CNV) in normal human genomic data due to noise and non-linear relationships between different genomic regions and signal intensity. A high-resolution array comparative genomic hybridization (aCGH) containing 42 million probes, which is very large compared to previous arrays, was recently published. Most existing CNV detection algorithms do not work well because of noise associated with the large amount of input data and because most of the current methods were not designed to analyze normal human samples. Normal human genome analysis often requires a joint approach across multiple samples. However, the majority of existing methods can only identify CNVs from a single sample.Methodology and Principal FindingsWe developed a multi-sample-based genomic variations detector (MGVD) that uses segmentation to identify common breakpoints across multiple samples and a k-means-based clustering strategy. Unlike previous methods, MGVD simultaneously considers multiple samples with different genomic intensities and identifies CNVs and CNV zones (CNVZs); CNVZ is a more precise measure of the location of a genomic variant than the CNV region (CNVR).Conclusions and SignificanceWe designed a specialized algorithm to detect common CNVs from extremely high-resolution multi-sample aCGH data. MGVD showed high sensitivity and a low false discovery rate for a simulated data set, and outperformed most current methods when real, high-resolution HapMap datasets were analyzed. MGVD also had the fastest runtime compared to the other algorithms evaluated when actual, high-resolution aCGH data were analyzed. The CNVZs identified by MGVD can be used in association studies for revealing relationships between phenotypes and genomic aberrations. Our algorithm was developed with standard C++ and is available in Linux and MS Windows format in the STL library. It is freely available at: http://embio.yonsei.ac.kr/~Park/mgvd.php.

Highlights

Copy number variations (CNVs) are a type of the human genomic structural variation
multi-sample-based genomic variations detector (MGVD) had the fastest runtime compared to the other algorithms evaluated when actual, high-resolution array comparative genomic hybridization (aCGH) data were analyzed
The CNV zones (CNVZs) identified by MGVD can be used in association studies for revealing relationships between phenotypes and genomic aberrations

Summary

Introduction

Copy number variations (CNVs) are a type of the human genomic structural variation. CNVs are recognized as a major source of human genetic variability, occupying a larger proportion of the genome than single nucleotide polymorphism (SNP) [1]. The mechanisms and medical relevance of CNVs in the human genome are not yet fully understood, a recent study focused on the relationships between CNVs and genes as well as SNPs and genes [5]. It is difficult to identify copy number variations (CNV) in normal human genomic data due to noise and nonlinear relationships between different genomic regions and signal intensity. Most existing CNV detection algorithms do not work well because of noise associated with the large amount of input data and because most of the current methods were not designed to analyze normal human samples. The majority of existing methods can only identify CNVs from a single sample

Methods

Results

Discussion

Conclusion