The struggle to find reliable results in exome sequencing data: filtering out Mendelian errors

Zubin H Patel,Kejian Zhang,David H Ledbetter,Kenneth M Kaufman,Mojtaba Kohram,Sara Lazaro,Leah C Kottyan,Hbgerard Tromp,Margaret K Hostetter,Michael Wagner,Andrew Rupert,Yaping Qian,John B Harley,Ammar Husami,Marc S Williams,C Alexander Valencia

doi:10.3389/fgene.2014.00016

Abstract

Next Generation Sequencing studies generate a large quantity of genetic data in a relatively cost and time efficient manner and provide an unprecedented opportunity to identify candidate causative variants that lead to disease phenotypes. A challenge to these studies is the generation of sequencing artifacts by current technologies. To identify and characterize the properties that distinguish false positive variants from true variants, we sequenced a child and both parents (one trio) using DNA isolated from three sources (blood, buccal cells, and saliva). The trio strategy allowed us to identify variants in the proband that could not have been inherited from the parents (Mendelian errors) and would most likely indicate sequencing artifacts. Quality control measurements were examined and three measurements were found to identify the greatest number of Mendelian errors. These included read depth, genotype quality score, and alternate allele ratio. Filtering the variants on these measurements removed ~95% of the Mendelian errors while retaining 80% of the called variants. These filters were applied independently. After filtering, the concordance between identical samples isolated from different sources was 99.99% as compared to 87% before filtering. This high concordance suggests that different sources of DNA can be used in trio studies without affecting the ability to identify causative polymorphisms. To facilitate analysis of next generation sequencing data, we developed the Cincinnati Analytical Suite for Sequencing Informatics (CASSI) to store sequencing files, metadata (eg. relatedness information), file versioning, data filtering, variant annotation, and identify candidate causative polymorphisms that follow either de novo, rare recessive homozygous or compound heterozygous inheritance models. We conclude the data cleaning process improves the signal to noise ratio in terms of variants and facilitates the identification of candidate disease causative polymorphisms.

Highlights

Next-generation sequencing (NGS) has emerged as a powerful tool to investigate the genetic etiology of diseases
In developing informatics filters for the NGS exome data, we aimed to retain the largest number of total variants while removing the largest possible number of Mendelian errors in the child
The vast majority of Mendelian errors in the unfiltered NGS data is due to sequencing error rather than de novo mutations based on the high fidelity of DNA replication in humans (Schmitt et al, 2009; Korona et al, 2011) and provide a method of tracking the effect of filters on sequencing artifacts

Summary

Introduction

Next-generation sequencing (NGS) has emerged as a powerful tool to investigate the genetic etiology of diseases. In NGS, a fastq file of millions of short DNA sequences is generated for each sample These fastq files are aligned to the reference genome using one of many different alignment tools. The extraordinary quantity of data generated even with a low error rate generates a large number of sequencing artifacts which will likely be called variants. SAM and BAM files are large files and contain hundreds of millions of short sequences aligned to the reference genome. Variant callers such as the Genome Analysis Tool Kit (GATK) are used to generate a list of the variants in the variant call format (VCF; McKenna et al, 2010). VCF files contain meta-information for each variant www.frontiersin.org

Objectives

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Frontiers in Genetics	Publication Date: Jan 1, 2014
Citations: 51	License type: cc-by

R Discovery Prime

R Discovery Prime

The struggle to find reliable results in exome sequencing data: filtering out Mendelian errors

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Frontiers in Genetics

Lead the way for us

Similar Papers

Detection of IL28B SNP DNA from Buccal Epithelial Cells, Small Amounts of Serum, and Dried Blood Spots
Philippe Halfon ... Anthony W I Lo
PLoS ONE | VOL. 7
Philippe Halfon, et. al.Philippe Halfon ... Anthony W I Lo
07 Mar 2012
PLoS ONE | VOL. 7

Accurate in silico confirmation of rare copy number variant calls from exome sequencing data using transfer learning.
Renjie Tan ... Yufeng Shen
Nucleic acids research | VOL. 50
Renjie Tan, et. al.Renjie Tan ... Yufeng Shen
16 Sep 2022
Nucleic acids research | VOL. 50

Identification of Novel Point Mutations in Splicing Sites by the Integration of Exome and RNA Sequencing Data in Myeloproliferative Diseases
Roberta Spinelli ... Carlo Gambacorti-Passerini
Blood | VOL. 118
Roberta Spinelli, et. al.Roberta Spinelli ... Carlo Gambacorti-Passerini
18 Nov 2011
Blood | VOL. 118

Chitinase genes revealed and compared in bacterial isolates, DNA extracts and a metagenomic library from a phytopathogen-suppressive soil
Karin Hjort ... Kornelia Smalla
FEMS Microbiology Ecology | VOL. 71
Karin Hjort, et. al.Karin Hjort ... Kornelia Smalla
23 Oct 2009
FEMS Microbiology Ecology | VOL. 71

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

The struggle to find reliable results in exome sequencing data: filtering out Mendelian errors

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Frontiers in Genetics