SeqSQC: A Bioconductor Package for Evaluating the Sample Quality of Next-generation Sequencing Data

Qian Liu,Qiang Hu,Song Yao,Marilyn L Kwan,Janise M Roh,Hua Zhao,Christine B Ambrosone,Lawrence H Kushi,Song Liu,Qianqian Zhu

doi:10.1016/j.gpb.2018.07.006

Abstract

As next-generation sequencing (NGS) technology has become widely used to identify genetic causal variants for various diseases and traits, a number of packages for checking NGS data quality have sprung up in public domains. In addition to the quality of sequencing data, sample quality issues, such as gender mismatch, abnormal inbreeding coefficient, cryptic relatedness, and population outliers, can also have fundamental impact on downstream analysis. However, there is a lack of tools specialized in identifying problematic samples from NGS data, often due to the limitation of sample size and variant counts. We developed SeqSQC, a Bioconductor package, to automate and accelerate sample cleaning in NGS data of any scale. SeqSQC is designed for efficient data storage and access, and equipped with interactive plots for intuitive data visualization to expedite the identification of problematic samples. SeqSQC is available at http://bioconductor.org/packages/SeqSQC.

Highlights

The past several years have seen the explosion of genetic and genomic studies utilizing next-generation sequencing (NGS) technology in basic sciences, translational research, and clinics [1,2,3,4,5,6,7]
Most currently available quality control (QC) tools for NGS data are designed for the base/read level QC, which typically involves assessing the intrinsic quality of the raw reads to diagnose artifacts that arise from the library preparation and sequencing run [9,10,11,12,13,14]
One strength of SeqSQC is that it incorporates a benchmark dataset generated from the 1000 Genomes Project with the study cohort during the QC process

Summary

Introduction

The past several years have seen the explosion of genetic and genomic studies utilizing next-generation sequencing (NGS) technology in basic sciences, translational research, and clinics [1,2,3,4,5,6,7]. A successful NGS study relies in large part on rigorous quality control (QC) to ensure that artifacts are removed before data analysis, so that real signals are not masked by quality issues. Most currently available QC tools for NGS data are designed for the base/read level QC, which typically involves assessing the intrinsic quality of the raw reads to diagnose artifacts that arise from the library preparation and sequencing run [9,10,11,12,13,14]. NGSQC [9] can monitor base/color code across each tile/panel, as well as quality measures for paired-end/mate pair libraries, whereas NGS QC Toolkit [10] is designed for homo-polymer trimming and primer/adaptor contamination removal. QC-chain [14] is a tool for quality assessment and trimming of raw reads, identification, quantification, and filtration of unknown contamination

Methods

Results

Conclusion