Abstract

Insertions and deletions (indels) are important types of structural variations. Obtaining accurate genotypes of indels may facilitate further genetic study. There are a few existing methods for calling indel genotypes from sequence reads. However, none of these tools can accurately call indel genotypes for indels of all lengths, especially for low coverage sequence data. In this paper, we present GINDEL, an approach for calling genotypes of both insertions and deletions from sequence reads. GINDEL uses a machine learning approach which combines multiple features extracted from next generation sequencing data. We test our approach on both simulated and real data and compare with existing tools, including Genome STRiP, Pindel and Clever-sv. Results show that GINDEL works well for deletions larger than 50 bp on both high and low coverage data. Also, GINDEL performs well for insertion genotyping on both simulated and real data. For comparison, Genome STRiP performs less well for shorter deletions (50–200 bp) on both simulated and real sequence data from the 1000 Genomes Project. Clever-sv performs well for intermediate deletions (200–1500 bp) but is less accurate when coverage is low. Pindel only works well for high coverage data, but does not perform well at low coverage. To summarize, we show that GINDEL not only can call genotypes of insertions and deletions (both short and long) for high and low coverage population sequence data, but also is more accurate and efficient than other approaches. The program GINDEL can be downloaded at: http://sourceforge.net/p/gindel

Highlights

  • Structural variation (SV) is the genetic variation in structure of an organism’s genome

  • We focus on analyzing low coverage population sequence reads, such as those being generated by the 1000 Genomes Project

  • With the trained support vector machine (SVM), GINDEL can call the genotypes of deletions and insertions accurately and efficiently

Read more

Summary

Introduction

Structural variation (SV) is the genetic variation in structure of an organism’s genome. The ongoing 1000 Genomes Project [1, 2] has released called structural variations for several human populations from hundreds of sequenced individuals [3]. Suppose a paired-end read ( called read pair) is mapped to the reference genome using a tool such as BWA [13]. Sometimes both reads of a read pair are mapped with the two reads agreeing on the orientation and order, but the insert size is discordant with the library insert size.

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call