Abstract

Comprehensive identification and cataloging of copy number variation (CNVs) are essential to providing a complete view of human genetic variation and to finding diseased genes. Due to the large-scale sequencing and cost control whole-genome sequencing (WGS) data, low-coverage data is favorably disposed towards CNV identification. However, such low-coverage data is sensitive to noise and sequencing biases, which results in low resolution of CNV detection in past experimental designs for WGS datasets. In this paper, we present a control-free Dirichlet process Gaussian mixture model (dpGMM) based approach, to analyze the read depth (RD) of low-coverage WGS datasets for CNV discovery. First, noise and biases of the RD signals are corrected through the preprocessing step of dpGMM. Then we assume that RD signals across genomic regions follow a Gaussian mixture model (GMM) in which each Gaussian distribution is followed by a copy number state. Without requiring the number of Gaussian distributions, dpGMM builds a Dirichlet process (DP) GMM for RD signals and further uses a DP prior to infer the number of Gaussian models. After that, we apply dpGMM to simulation datasets with different coverages and individual datasets, and compare ours to three widely used RD-based pipelines, CNVnator, GROM-RD, and BIC-seq2. Simulation results demonstrate that our approach, dpGMM, has a high F1 score in both low- and high- coverage sequences. Also, the number of overlaps between CNVs detected in real data by ours and the standard benchmark is twice as much as that detected by other tools such as CNVnator and GROM-RD.

Highlights

  • Copy number variations (CNVs), as an important form of structural variations, have gained considerable interest in genetic and functional analysis of human genome variation

  • The key idea of the Dirichlet process Gaussian mixture model (dpGMM) method is that it analyzes read depth (RD) signals and establishes a Dirichlet process (DP) Gaussian mixture model (GMM) for the RD signals, which adopts a DP as prior to solve the DP GMM, instead of giving a prior of the number of Gaussian components

  • The effective number of Gaussian components can be inferred from the RD signal data, which allows us to recognize specific copy number states for RD signals of genomic regions

Read more

Summary

Introduction

Copy number variations (CNVs), as an important form of structural variations, have gained considerable interest in genetic and functional analysis of human genome variation. Several large-scale studies have reported CNV participates in phenotypic variation and adaptation by disrupting genes and altering gene dosage [1], [2]. The associate editor coordinating the review of this manuscript and approving it for publication was Liangtian Wan. whole-genome sequencing (WGS) data contributes significantly to research on human diversity and disease. The rapid development of next-generation sequencing (NGS) technology has provided an unprecedented opportunity for genome-wide analysis of CNVs on the scale of whole-genome. Due to the cost control, low-coverage data is often favored in genome-wide variation analysis. Read depth (RD) signals from low-coverage data are sensitive to systematic noises, and sequencing biases, which may cause false CNV calls using RD-based methods

Objectives
Methods
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call