A randomized approach to speed up the analysis of large-scale read-count data in the application of CNV detection

Weibo Wang,Jin Szatkiewicz,Wei Sun,Wei Wang

doi:10.1186/s12859-018-2077-6

Weibo Wang, Jin Szatkiewicz + Show 2 more

Open Access

https://doi.org/10.1186/s12859-018-2077-6

Copy DOI

Abstract

BackgroundThe application of high-throughput sequencing in a broad range of quantitative genomic assays (e.g., DNA-seq, ChIP-seq) has created a high demand for the analysis of large-scale read-count data. Typically, the genome is divided into tiling windows and windowed read-count data is generated for the entire genome from which genomic signals are detected (e.g. copy number changes in DNA-seq, enrichment peaks in ChIP-seq). For accurate analysis of read-count data, many state-of-the-art statistical methods use generalized linear models (GLM) coupled with the negative-binomial (NB) distribution by leveraging its ability for simultaneous bias correction and signal detection. However, although statistically powerful, the GLM+NB method has a quadratic computational complexity and therefore suffers from slow running time when applied to large-scale windowed read-count data. In this study, we aimed to speed up substantially the GLM+NB method by using a randomized algorithm and we demonstrate here the utility of our approach in the application of detecting copy number variants (CNVs) using a real example.ResultsWe propose an efficient estimator, the randomized GLM+NB coefficients estimator (RGE), for speeding up the GLM+NB method. RGE samples the read-count data and solves the estimation problem on a smaller scale. We first theoretically validated the consistency and the variance properties of RGE. We then applied RGE to GENSENG, a GLM+NB based method for detecting CNVs. We named the resulting method as “R-GENSENG". Based on extensive evaluation using both simulated and empirical data, we concluded that R-GENSENG is ten times faster than the original GENSENG while maintaining GENSENG’s accuracy in CNV detection.ConclusionsOur results suggest that RGE strategy developed here could be applied to other GLM+NB based read-count analyses, i.e. ChIP-seq data analysis, to substantially improve their computational efficiency while preserving the analytic power.

Highlights

The application of high-throughput sequencing in a broad range of quantitative genomic assays (e.g., DNA-seq, ChIP-seq) has created a high demand for the analysis of large-scale read-count data
Our results suggest that randomized GLM+NB coefficients estimator (RGE) and the strategy developed in this work could be applied to other generalized linear models (GLM)+NB based read-count analyses to substantially improve their computational efficiency while preserving the analytic power
Applying RGE to speed up Copy-number variants (CNV) detection we demonstrate an example usage of RGE to speed up GENSENG, a GLM+NB based CNV detection method from read-count data of germline samples

Summary

Introduction

The application of high-throughput sequencing in a broad range of quantitative genomic assays (e.g., DNA-seq, ChIP-seq) has created a high demand for the analysis of large-scale read-count data. The genome is divided into tiling windows and windowed read-count data is generated for the entire genome from which genomic signals are detected (e.g. copy number changes in DNA-seq, enrichment peaks in ChIP-seq). While statistically powerful, GLM+NB methods encounter a big data problem [18] when applied to whole-genome windowed read count data with tens of millions of windows Such applications include detecting CNV from whole-genome DNA-seq data [8, 10], detecting enrichment peaks from whole-genome ChIP-seq data [19], and finding association between histone modification or open chromatin with DNA sequence content [20]

Objectives

Methods

Results

Conclusion