Abstract
Although recent developments in DNA sequencing have allowed for great leaps in both the quality and quantity of genome assembly projects, de novo assemblies still lack the efficiency and accuracy required for studying genetic variation of individuals. Thus, efficient and accurate methods for calling and genotyping genetic variants are fundamental to studying the genomes of individuals. We study the problem of genotyping insertion variants. We assume that the location of the insertion is given, and the task is to find the insertion sequence. Insertions are the hardest structural variant to genotype, because the insertion sequence must be assembled from the reads, whereas genotyping other structural variants only requires transformations of the reference genome. The current methods for constructing insertion variants are mostly linked to variation calling methods and are only able to construct small insertions. A sub-problem in genome assembly, the gap filling problem, provides techniques that are readily applicable to insertion genotyping. Gap filling takes the context and length of a missing sequence in a genome assembly and attempts to assemble the intervening sequence. In this paper we show how tools and methods for gap filling can be used to assemble insertion variants by modeling the problem of insertion genotyping as filling gaps in the reference genome. We further give a general read filtering scheme to make the method scalable to large data sets. Our results show that gap filling methods are competitive against insertion genotyping tools. We further show that read filtering improves performance of insertion genotyping especially for long insertions. Our experiments show that on long insertions the new proposed method is the most accurate one, whereas on short insertions it has comparable performance as compared against existing tools.
Highlights
High-throughput sequencing is today part of the standard toolbox in life science research
The filtering is compared to a known truth by mapping the same reads to the reference genome without the gaps and taking all the reads that overlap with a given gap
We have shown how gap filling tools developed for de novo genome assembly can be applied to the insertion genotyping problem and how the performance of these tools can be improved using read filtering
Summary
High-throughput sequencing is today part of the standard toolbox in life science research. Despite the advances in sequencing technologies fully constructing the genome of an individual, i.e. de novo genome assembly, is still a time consuming task especially for large eukaryotic genomes [1]. If a reference genome is available, like for the human genome, usually a resequencing approach is applied to determine the genetic variants in a donor genome as compared to the reference. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript
Published Version (
Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have