Abstract
BackgroundThere is considerable interest in the development of methods to efficiently identify all coding variants present in large sample sets of humans. There are three approaches possible: whole-genome sequencing, whole-exome sequencing using exon capture methods, and RNA-Seq. While whole-genome sequencing is the most complete, it remains sufficiently expensive that cost effective alternatives are important.ResultsHere we provide a systematic exploration of how well RNA-Seq can identify human coding variants by comparing variants identified through high coverage whole-genome sequencing to those identified by high coverage RNA-Seq in the same individual. This comparison allowed us to directly evaluate the sensitivity and specificity of RNA-Seq in identifying coding variants, and to evaluate how key parameters such as the degree of coverage and the expression levels of genes interact to influence performance. We find that although only 40% of exonic variants identified by whole genome sequencing were captured using RNA-Seq; this number rose to 81% when concentrating on genes known to be well-expressed in the source tissue. We also find that a high false positive rate can be problematic when working with RNA-Seq data, especially at higher levels of coverage.ConclusionsWe conclude that as long as a tissue relevant to the trait under study is available and suitable quality control screens are implemented, RNA-Seq is a fast and inexpensive alternative approach for finding coding variants in genes with sufficiently high expression levels.
Highlights
There is considerable interest in the development of methods to efficiently identify all coding variants present in large sample sets of humans
By comparing the single nucleotide variant (SNV) identified in the transcriptome at different levels of coverage to those identified in the genomic DNA (gDNA), we are able to directly evaluate how well RNA-Seq captures genomic variants. Alignment and coverage Both DNA and RNA were extracted from peripheral blood mononuclear cells (PBMCs) from the same individual
We evaluated how the absolute number of true positive SNVs called depends on the amount of sequence data, in lanes, for all exons and for exons from PBMCexpressed genes (Figure 3)
Summary
There is considerable interest in the development of methods to efficiently identify all coding variants present in large sample sets of humans. The most comprehensive approach for focusing on exons alone is clearly exome capture, where regions matching a defined set of coding exons are pulled from the genomic DNA (gDNA) using microarrays and sequenced. This approach requires an initial and costly hybridization step. The cost of exome sequencing has contributed to the interest in sequencing the transcriptome (RNA-Seq) as an alternative, and possibly easier and less expensive strategy [2] While this approach will clearly miss poorly expressed genes in whatever tissue is being studied, it does have the advantage of generating additional information, such as gene expression level and splicing patterns
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.