SLMSuite: a suite of algorithms for segmenting genomic profiles

Valerio Orlandini,Alberto Magi,Sabrina Giglio,Aldesia Provenzano

doi:10.1186/s12859-017-1734-5

Valerio Orlandini, Alberto Magi + Show 2 more

Open Access

https://doi.org/10.1186/s12859-017-1734-5

Copy DOI

Abstract

BackgroundThe identification of copy number variants (CNVs) is essential to study human genetic variation and to understand the genetic basis of mendelian disorders and cancers. At present, genome-wide detection of CNVs can be achieved using microarray or second generation sequencing (SGS) data. Although these technologies are very different, the genomic profiles that they generate are mathematically very similar and consist of noisy signals in which a decrease or increase of consecutive data represent deletions or duplication of DNA. In this framework, the most important step of the analysis consists of segmenting genomic profiles for the identification of the boundaries of genomic regions with increased or decreased signal.ResultsHere we introduce SLMSuite, a collection of algorithms, based on shifting level models (SLM), to segment genomic profiles from array and SGS experiments. The SLM algorithms take as input the log-transformed genomic profiles from SGS or microarray experiments and output segmentation results. We apply our method to the analysis of synthetic genomic profiles and real whole genome sequencing data and we demonstrate that it outperforms the state of the art circular binary segmentation algorithm in terms of sensitivity, specificity and computational speed.ConclusionThe SLMSuite contains an R library with the segmentation methods and three wrappers that allow to use them in Python, Ruby and C++. SLMSuite is freely available at https://sourceforge.net/projects/slmsuite.

Highlights

The identification of copy number variants (CNVs) is essential to study human genetic variation and to understand the genetic basis of mendelian disorders and cancers
In the last few years we developed a class of algorithms, based on shifting level models (SLM), that allow to segment with high accuracy genomic profiles
The first SLM algorithm [5] was developed for analyzing log2-ratio data from CGH-array, the multivariate version, JointSLM [6] was written for the joint segmentation of multiple Read count (RC) signals, while the heterogeneous version, heterogeneous shifting levels model (HSLM) [7] was properly tailored for segmenting spatially sparse data from whole-exome sequencing (WES) experiments

Summary

Results

For datasets made of large number of windows (≥ 50000) SLM was able to segment genomic profiles in less than 10 seconds while CBS scaled up in the order of minutes This result is of great relevance for the analysis of high coverage whole genome sequencing data with small window size (100 bp) that generate genomic profiles up to 2.5 millions of RC data points. Since the capability of detecting genomic regions involved in CNVs is influenced by the length of the event, we distinguished three classes of variants: Small (length < 20Kb), Medium (length ≥ 20Kb and < 100Kb) and Large (length ≥ 100Kb)

Background

Conclusion