SAMQA: error classification and validation of high-throughput sequenced read data

Thomas Robinson,Sarah Killcoyne,John Boyle,Ryan Bressler

doi:10.1186/1471-2164-12-419

Thomas Robinson, Sarah Killcoyne + Show 2 more

Open Access

https://doi.org/10.1186/1471-2164-12-419

Copy DOI

Journal: BMC Genomics	Publication Date: Aug 18, 2011
Citations: 23	License type: cc-by

Affiliation: Institute for Systems Biology, Seattle University

Abstract

BackgroundThe advances in high-throughput sequencing technologies and growth in data sizes has highlighted the need for scalable tools to perform quality assurance testing. These tests are necessary to ensure that data is of a minimum necessary standard for use in downstream analysis. In this paper we present the SAMQA tool to rapidly and robustly identify errors in population-scale sequence data.ResultsSAMQA has been used on samples from three separate sets of cancer genome data from The Cancer Genome Atlas (TCGA) project. Using technical standards provided by the SAM specification and biological standards defined by researchers, we have classified errors in these sequence data sets relative to individual reads within a sample. Due to an observed linearithmic speedup through the use of a high-performance computing (HPC) framework for the majority of tasks, poor quality data was identified prior to secondary analysis in significantly less time on the HPC framework than the same data run using alternative parallelization strategies on a single server.ConclusionsThe SAMQA toolset validates a minimum set of data quality standards across whole-genome and exome sequences. It is tuned to run on a high-performance computational framework, enabling QA across hundreds gigabytes of samples regardless of coverage or sample type.

Highlights

The advances in high-throughput sequencing technologies and growth in data sizes has highlighted the need for scalable tools to perform quality assurance testing
The SAMQA toolkit was developed to support work being undertaken at the Center for Systems Analysis of the Cancer Regulome, which is one of the The Cancer Genome Atlas (TCGA) Genome Data Analysis Centers
SAMQA is a QA analysis toolkit that runs a series of tests over sequenced read data and is optimized for large numbers of files

Summary

Results

The SAMQA toolkit was developed to support work being undertaken at the Center for Systems Analysis of the Cancer Regulome, which is one of the TCGA Genome Data Analysis Centers. In a recent QA run on COAD/READ (Colon/ Rectal Adenocarcinoma) samples the tool was used to analyze 324 exome and 42 full genome samples. The results of the technical tests are summarized, and the results of the biological tests are shown in Figures 3 and 4 (SAMQA output shown in additional file 1). The tool automatically rejected those samples that failed the technical tests (e.g. six samples that contained only unmapped reads). The tests themselves are output as a single file, and can be read directly into an analysis program. The supplementary materials contains the output for the default tests that have been run across both the COAD/ READ samples, as well as Glioblastoma (GBM) and Ovarian (OV) cancer samples. The SAMQA toolset consists of eleven different technical and biologically relevant tests run over each BAM file.

Conclusions

Background

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

SAMQA: error classification and validation of high-throughput sequenced read data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Genomics

Lead the way for us

Similar Papers

Multidisciplinary Design Optimization of Aero-craft Shapes by Using Grid Based High Performance Computational Framework
Hong Liu ... Qianni Deng
-
Hong Liu, et. al.Hong Liu ... Qianni Deng
01 Jan 2004
01 Jan 2004

Aero-crafts Aerodynamic Simulation and Optimization by Using “CFD-Grid” Based on Service Domain
Hong Liu ... Ming-Lu Li
-
Hong Liu, et. al.Hong Liu ... Ming-Lu Li
01 Jan 2004
01 Jan 2004

Comprehensive Analysis of High-Performance Computing Methods for Filtered Back-Projection
Christian B Mendl ... Michelle Noga
ELCVIA Electronic Letters on Computer Vision and Image Analysis | VOL. 12
Christian B Mendl, et. al.Christian B Mendl ... Michelle Noga
04 Mar 2013
ELCVIA Electronic Letters on Computer Vision and Image Analysis | VOL. 12

Reproduced Computational Results Report for “Ginkgo: A Modern Linear Operator Algebra Framework for High Performance Computing”
Cody J Balos
ACM Transactions on Mathematical Software | VOL. 48
Cody J BalosCody J Balos
16 Feb 2022
ACM Transactions on Mathematical Software | VOL. 48

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

SAMQA: error classification and validation of high-throughput sequenced read data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Genomics