ForestQC: Quality control on genetic variants from next-generation sequencing data using random forest.

Jiajin Li,Sungoo Hwang,Nelson B Freimer,Brandon Jew,Jae Hoon Sul,Lingyu Zhan,Giovanni Coppola,Mihaela Pertea

doi:10.1371/journal.pcbi.1007556

Jiajin Li, Sungoo Hwang + Show 6 more

Open Access

https://doi.org/10.1371/journal.pcbi.1007556

Copy DOI

Abstract

Next-generation sequencing technology (NGS) enables the discovery of nearly all genetic variants present in a genome. A subset of these variants, however, may have poor sequencing quality due to limitations in NGS or variant callers. In genetic studies that analyze a large number of sequenced individuals, it is critical to detect and remove those variants with poor quality as they may cause spurious findings. In this paper, we present ForestQC, a statistical tool for performing quality control on variants identified from NGS data by combining a traditional filtering approach and a machine learning approach. Our software uses the information on sequencing quality, such as sequencing depth, genotyping quality, and GC contents, to predict whether a particular variant is likely to be false-positive. To evaluate ForestQC, we applied it to two whole-genome sequencing datasets where one dataset consists of related individuals from families while the other consists of unrelated individuals. Results indicate that ForestQC outperforms widely used methods for performing quality control on variants such as VQSR of GATK by considerably improving the quality of variants to be included in the analysis. ForestQC is also very efficient, and hence can be applied to large sequencing datasets. We conclude that combining a machine learning algorithm trained with sequencing quality information and the filtering approach is a practical approach to perform quality control on genetic variants from sequencing data.

Highlights

Over the past few years, genome-wide association studies (GWAS) have been playing an essential role in identifying genetic variations associated with diseases or complex traits [1,2]
We show that ForestQC outperforms Variant Quality Score Recalibration (VQSR) and a filtering approach based on allele balance of heterozygous calls (ABHet) as high-quality variants detected from ForestQC have higher sequencing quality than those from VQSR and the filtering approach in both datasets
These statistics consist of ABHet, Hardy-Weinberg Equilibrium (HWE) p-value, genotype missing rate, Mendelian error rate for family-based datasets, and any user-defined statistics

Summary

Introduction

Over the past few years, genome-wide association studies (GWAS) have been playing an essential role in identifying genetic variations associated with diseases or complex traits [1,2]. GWAS have found many associations between common variants and human diseases, such as schizophrenia [3], type 2 diabetes [4,5], and Parkinson’s Disease [6]. These common variants typically explain only a small fraction of heritability for the complex traits [7,8]. Wholegenome sequencing (WGS) has been used to identify rare variants associated with prostate cancer [14], and with whole-exome sequencing, studies have detected rare variants associated with LDL cholesterol [15] and autism [16]

Methods

Results

Discussion

Conclusion