Customisation of the Exome Data Analysis Pipeline Using a Combinatorial Approach

Swetansu Pattnaik,Srividya Vaidyanathan,Sa Deepak,Durgad G Pooja,Binay Panda,Paolo Provero

doi:10.1371/journal.pone.0030080

Abstract

The advent of next generation sequencing (NGS) technologies have revolutionised the way biologists produce, analyse and interpret data. Although NGS platforms provide a cost-effective way to discover genome-wide variants from a single experiment, variants discovered by NGS need follow up validation due to the high error rates associated with various sequencing chemistries. Recently, whole exome sequencing has been proposed as an affordable option compared to whole genome runs but it still requires follow up validation of all the novel exomic variants. Customarily, a consensus approach is used to overcome the systematic errors inherent to the sequencing technology, alignment and post alignment variant detection algorithms. However, the aforementioned approach warrants the use of multiple sequencing chemistry, multiple alignment tools, multiple variant callers which may not be viable in terms of time and money for individual investigators with limited informatics know-how. Biologists often lack the requisite training to deal with the huge amount of data produced by NGS runs and face difficulty in choosing from the list of freely available analytical tools for NGS data analysis. Hence, there is a need to customise the NGS data analysis pipeline to preferentially retain true variants by minimising the incidence of false positives and make the choice of right analytical tools easier. To this end, we have sampled different freely available tools used at the alignment and post alignment stage suggesting the use of the most suitable combination determined by a simple framework of pre-existing metrics to create significant datasets.

Highlights

DNA Sequencing has come a long way since its first discovered more than 30 years back, in terms of speed, throughput and cost
As amount of data generated from the generation sequencing (NGS) platforms is very large that requires sophisticated informatics tools and skills in computational biology to mine, analyse and interpret the data, bulk of the research in the field has come from a handful of large genome centres that employ large numbers of computer scientists, computational biologists and bioinformatics specialists
Most of the aligners are challenged by the above limitation wherein the algorithms tend to lose true positives due to under mapping of reads with inexact matches and allele specific mapping bias

Summary

Introduction

DNA Sequencing has come a long way since its first discovered more than 30 years back, in terms of speed, throughput and cost. In order to make NGS technology ubiquitous and clinically useful, one needs to come up with simplified analysis tools that produce more true positive calls and reduces efforts and money required for downstream validation experiments. Calculation of alignment maps without increase in computer hardware requirements for high throughput data has been the central theme for optimization of most alignment algorithms necessitating the use of approximate heuristic methods. These approximations have definitely achieved speed gains by accommodating low-quality alignments in varying degrees. The speed limits of these algorithms will be challenged more seriously as the sequence capacity grows and will further test the balance between speed and accuracy of these processes

Objectives

Methods

Results

Conclusion