Abstract

BackgroundThe advances in high-throughput sequencing technologies and growth in data sizes has highlighted the need for scalable tools to perform quality assurance testing. These tests are necessary to ensure that data is of a minimum necessary standard for use in downstream analysis. In this paper we present the SAMQA tool to rapidly and robustly identify errors in population-scale sequence data.ResultsSAMQA has been used on samples from three separate sets of cancer genome data from The Cancer Genome Atlas (TCGA) project. Using technical standards provided by the SAM specification and biological standards defined by researchers, we have classified errors in these sequence data sets relative to individual reads within a sample. Due to an observed linearithmic speedup through the use of a high-performance computing (HPC) framework for the majority of tasks, poor quality data was identified prior to secondary analysis in significantly less time on the HPC framework than the same data run using alternative parallelization strategies on a single server.ConclusionsThe SAMQA toolset validates a minimum set of data quality standards across whole-genome and exome sequences. It is tuned to run on a high-performance computational framework, enabling QA across hundreds gigabytes of samples regardless of coverage or sample type.

Highlights

  • The advances in high-throughput sequencing technologies and growth in data sizes has highlighted the need for scalable tools to perform quality assurance testing

  • The SAMQA toolkit was developed to support work being undertaken at the Center for Systems Analysis of the Cancer Regulome, which is one of the The Cancer Genome Atlas (TCGA) Genome Data Analysis Centers

  • SAMQA is a QA analysis toolkit that runs a series of tests over sequenced read data and is optimized for large numbers of files

Read more

Summary

Results

The SAMQA toolkit was developed to support work being undertaken at the Center for Systems Analysis of the Cancer Regulome, which is one of the TCGA Genome Data Analysis Centers. In a recent QA run on COAD/READ (Colon/ Rectal Adenocarcinoma) samples the tool was used to analyze 324 exome and 42 full genome samples. The results of the technical tests are summarized, and the results of the biological tests are shown in Figures 3 and 4 (SAMQA output shown in additional file 1). The tool automatically rejected those samples that failed the technical tests (e.g. six samples that contained only unmapped reads). The tests themselves are output as a single file, and can be read directly into an analysis program. The supplementary materials contains the output for the default tests that have been run across both the COAD/ READ samples, as well as Glioblastoma (GBM) and Ovarian (OV) cancer samples. The SAMQA toolset consists of eleven different technical and biologically relevant tests run over each BAM file.

Conclusions
Background
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call