Efficient digest of high-throughput sequencing data in a reproducible report

Zhe Zhang,Hongbo M Xie,Patrick V Warren,Juan C Perin,Angela M Yu,Ariella Sasson,Mahdi Sarmady,Jeremy Leipzig,Peter S White

doi:10.1186/1471-2105-14-s11-s3

Zhe Zhang, Hongbo M Xie + Show 7 more

Open Access

https://doi.org/10.1186/1471-2105-14-s11-s3

Copy DOI

Abstract

BackgroundHigh-throughput sequencing (HTS) technologies are spearheading the accelerated development of biomedical research. Processing and summarizing the large amount of data generated by HTS presents a non-trivial challenge to bioinformatics. A commonly adopted standard is to store sequencing reads aligned to a reference genome in SAM (Sequence Alignment/Map) or BAM (Binary Alignment/Map) files. Quality control of SAM/BAM files is a critical checkpoint before downstream analysis. The goal of the current project is to facilitate and standardize this process.ResultsWe developed bamchop, a robust program to efficiently summarize key statistical metrics of HTS data stored in BAM files, and to visually present the results in a formatted report. The report documents information about various aspects of HTS data, such as sequencing quality, mapping to a reference genome, sequencing coverage, and base frequency. Bamchop uses the R language and Bioconductor packages to calculate statistical matrices and the Sweave utility and associated LaTeX markup for documentation. Bamchop's efficiency and robustness were tested on BAM files generated by local sequencing facilities and the 1000 Genomes Project. Source code, instruction and example reports of bamchop are freely available from https://github.com/CBMi-BiG/bamchop.ConclusionsBamchop enables biomedical researchers to quickly and rigorously evaluate HTS data by providing a convenient synopsis and user-friendly reports.

Highlights

High-throughput sequencing (HTS) technologies are spearheading the accelerated development of biomedical research
One of the most successful efforts to standardize HTS workflow was the development of the Sequence Alignment/Map (SAM) format for the storage of aligned sequencing reads, along with a corresponding set of utility programs operating on SAM files [11]
SAM and its more practically utilized binary companion, BAM (Binary Alignment/Map), have been generally accepted as a standard to store and exchange aligned reads by the genomics community, including sequencing facilities and large-scale HTS projects such as the 1000 Genomes [12] and ENCODE [13] projects

Summary

Introduction

High-throughput sequencing (HTS) technologies are spearheading the accelerated development of biomedical research. The development of high-throughput sequencing (HTS) technologies has lead to major biomedical discoveries in recent years [1,2,3] The power of these technologies comes from the repeated sequencing of genomic regions of interest, such as exons [4] and protein binding sites [5], and requires processing millions of sequencing reads contained within raw data files sized between several hundred megabytes to over twenty-five gigabytes [6]. SAM and its more practically utilized binary companion, BAM (Binary Alignment/Map), have been generally accepted as a standard to store and exchange aligned reads by the genomics community, including sequencing facilities and large-scale HTS projects such as the 1000 Genomes [12] and ENCODE [13] projects. BAM files can be further sorted and indexed to support random access to reads mapped to any genomic location

Objectives

Results

Conclusion