NGS-QCbox and Raspberry for Parallel, Automated and Rapid Quality Control Analysis of Large-Scale Next Generation Sequencing (Illumina) Data.

Mohan A V S K Katta,Mahendar Thudi,Rajeev K Varshney,Dadakhalandar Doddamani,Aamir W Khan,Junwen Wang

doi:10.1371/journal.pone.0139868

Abstract

Rapid popularity and adaptation of next generation sequencing (NGS) approaches have generated huge volumes of data. High throughput platforms like Illumina HiSeq produce terabytes of raw data that requires quick processing. Quality control of the data is an important component prior to the downstream analyses. To address these issues, we have developed a quality control pipeline, NGS-QCbox that scales up to process hundreds or thousands of samples. Raspberry is an in-house tool, developed in C language utilizing HTSlib (v1.2.1) (http://htslib.org), for computing read/base level statistics. It can be used as stand-alone application and can process both compressed and uncompressed FASTQ format files. NGS-QCbox integrates Raspberry with other open-source tools for alignment (Bowtie2), SNP calling (SAMtools) and other utilities (bedtools) towards analyzing raw NGS data at higher efficiency and in high-throughput manner. The pipeline implements batch processing of jobs using Bpipe (https://github.com/ssadedin/bpipe) in parallel and internally, a fine grained task parallelization utilizing OpenMP. It reports read and base statistics along with genome coverage and variants in a user friendly format. The pipeline developed presents a simple menu driven interface and can be used in either quick or complete mode. In addition, the pipeline in quick mode outperforms in speed against other similar existing QC pipeline/tools. The NGS-QCbox pipeline, Raspberry tool and associated scripts are made available at the URL https://github.com/CEG-ICRISAT/NGS-QCbox and https://github.com/CEG-ICRISAT/Raspberry for rapid quality control analysis of large-scale next generation sequencing (Illumina) data.

Highlights

Generation sequencing (NGS) technologies generates large volumes of data that are proven to be cost effective over conventional sequencing methods
The data generated from Illumina sequencing machines is in binary format
In order to assess the quality of the data generated, we have developed an in-house tool called Raspberry (v0.3) in C language utilizing HTSlib (v1.2) towards computing read/base level metrics

Summary

Introduction

Generation sequencing (NGS) technologies generates large volumes of data that are proven to be cost effective over conventional sequencing methods. There is a pressing need for tools that can scale up to process thousands of samples simultaneously in short time In this context, quality control (QC) of raw and large-scale NGS data demands automation. Quality control tools/pipelines like NGS QC Toolkit [8] (http://59.163.192.90:8080/ngsqctoolkit) and Python (http://www.python.org) based HTSeq [9] (http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html) were developed to address these constraints but are slow. These pipelines/tools are meant to work on datasets in serial manner that can be daunting for the end user while dealing with large datasets. It aims to be a decision making tool in assisting the scientist to judge if sufficient quality data has been generated with an optimal coverage as the experiment demands

Results and Discussion

Conclusions

Materials and Methods