Abstract

Detecting genetic variation is one of the main applications of high-throughput sequencing, but is still challenging wherever aligning short reads poses ambiguities. Current state-of-the-art variant calling approaches avoid such regions, arguing that it is necessary to sacrifice detection sensitivity to limit false discovery. We developed a method that links candidate variant positions within repetitive genomic regions into clusters. The technique relies on a resource, a thesaurus of genetic variation, that enumerates genomic regions with similar sequence. The resource is computationally intensive to generate, but once compiled can be applied efficiently to annotate and prioritize variants in repetitive regions. We show that thesaurus annotation can reduce the rate of false variant calls due to mappability by up to three orders of magnitude. We apply the technique to whole genome datasets and establish that called variants in low mappability regions annotated using the thesaurus can be experimentally validated. We then extend the analysis to a large panel of exomes to show that the annotation technique opens possibilities to study variation in hereto hidden and under-studied parts of the genome.

Highlights

  • Detection of genetic variation is one of the main applications of high-throughput sequencing and several software solutions exist tailored to this task [1,2]

  • When we estimated B-allele frequencies (BAFs) using only the reads at the called positions in low mappability regions, we found they were often smaller than unity

  • We identified sites called in low mappability regions that were missed using an intermediate mappability threshold, i.e. sites that were not themselves called at intermediate mappability settings and that were not linked to sites called at intermediate mappability settings

Read more

Summary

Introduction

Detection of genetic variation is one of the main applications of high-throughput sequencing and several software solutions exist tailored to this task [1,2]. These methods have already enabled breakthroughs in understanding of cancers [3,4,5]. Efforts to limit the false discovery rate during variant calling have led bioinformatic methods to avoid analyzing regions of the human genome where alignment of short reads poses ambiguities. Despite abundance of raw data, much genetic variation in such regions still remains uncharacterized

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call