Sequana coverage: detection and characterization of genomic variations using running median and mixture models.

Dimitri Desvillechabrol,Christiane Bouchier,Sean Kennedy,Thomas Cokelaer

doi:10.1093/gigascience/giy110

Dimitri Desvillechabrol, Christiane Bouchier + Show 2 more

Open Access

https://doi.org/10.1093/gigascience/giy110

Copy DOI

Journal: GigaScience	Publication Date: Sep 6, 2018
Citations: 12	License type: CC BY 4.0

Affiliation: Institut Pasteur

Abstract

BackgroundIn addition to mapping quality information, the Genome coverage contains valuable biological information such as the presence of repetitive regions, deleted genes, or copy number variations (CNVs). It is essential to take into consideration atypical regions, trends (e.g., origin of replication), or known and unknown biases that influence coverage. It is also important that reported events have robust statistics (e.g. z-score) associated with their detections as well as precise location.ResultsWe provide a stand-alone application, sequana_coverage, that reports genomic regions of interest (ROIs) that are significantly over- or underrepresented in high-throughput sequencing data. Significance is associated with the events as well as characteristics such as length of the regions. The algorithm first detrends the data using an efficient running median algorithm. It then estimates the distribution of the normalized genome coverage with a Gaussian mixture model. Finally, a z-score statistic is assigned to each base position and used to separate the central distribution from the ROIs (i.e., under- and overcovered regions). A double thresholds mechanism is used to cluster the genomic ROIs. HTML reports provide a summary with interactive visual representations of the genomic ROIs with standard plots and metrics. Genomic variations such as single-nucleotide variants or CNVs can be effectively identified at the same time.

Highlights

In addition to mapping quality information, the Genome coverage contains valuable biological information such as the presence of repetitive regions, deleted genes, or copy number variations (CNVs)
We provide an implementation within the Sequana project [34], which is a Python library that provides highthroughput sequencing (HTS) pipelines based on the workflow management system called Snakemake [35] (Makefile-like with a Python syntax)
We compared the results provided in the supplementary data of [16] with those obtained by running sequana coverage and CNVnator

Summary

Introduction

In addition to mapping quality information, the Genome coverage contains valuable biological information such as the presence of repetitive regions, deleted genes, or copy number variations (CNVs). Results: We provide a stand-alone application, sequana coverage, that reports genomic regions of interest (ROIs) that are significantly over- or underrepresented in high-throughput sequencing data. The algorithm first detrends the data using an efficient running median algorithm It estimates the distribution of the normalized genome coverage with a Gaussian mixture model. HTML reports provide a summary with interactive visual representations of the genomic ROIs with standard plots and metrics Genomic variations such as single-nucleotide variants or CNVs can be effectively identified at the same time. The emergence of second-generation sequencing, which is known as nextgeneration sequencing, or NGS hereafter, has dramatically reduced the sequencing cost This breakthrough multiplied the number of genomic analyses undertaken by research laboratories and yielded vast amounts of data. Read lengths vary from 35 to 300 bases for current short-read approaches [1] to several tens of thousands of bases possible with long-read technologies such as Pacific Biosciences [5, 6] or Oxford Nanopore [7]

Objectives

Methods

Findings

Conclusion