Approximate string matching for high-throughput sequencing

Enrico Siragusa

doi:10.17169/refubium-15562

Abstract

Over thepast years, high-throughput sequencing (HTS)hasbecomean invaluablemethod of investigation in molecular and medical biology. HTS technologies allow to sequence cheaply and rapidly an individual’s DNA sample under the form of billions of short DNA reads. The ability to assess the content of a DNA sample at base-level resolution opens the way to a myriad of applications, including individual genotyping and assessment of large structural variations, measurement of gene expression levels and characterization of epigenetic features. Nonetheless, the quantity and quality of data produced by HTS instruments call for computationally ef icient and accurate analysis methods. In this thesis, I present novel methods for the mapping of high-throughput sequencing DNA reads, based on state of the art approximate string matching algorithms and data structures. Read mapping is a fundamental step of any HTS data analysis pipeline in resequencing projects, where DNA reads are reassembled by aligning them back to a previously known reference genome. The ingenuity of approximate string matching methods is crucial to design ef icient and accurate read mapping tools. In the irst part of this thesis, I cover practical indexing and iltering methods for exact and approximate stringmatching. I present state of the art algorithms and data structures, give their pseudocode and discuss their implementation. Furthermore, I provide all implementationswithin SeqAn, the generic C++ template library for sequence analysis, which is freely available under http://www.seqan.de/. Subsequently, I experimentally evaluate all implemented methods, with the aim of guiding the engineering of new sequence alignment software. To the best of my knowledge, this is the irst study providing a comprehensive exposition, implementation and evaluation of such methods. In the second part of this thesis, I turn to the engineering and evaluation of readmapping tools. First, I present a novel method to ind all mapping locations per read within a user-de ined error rate; this method is published in the peer-reviewed journal Nucleic Acids Research and packaged in a open source tool nicknamedMasai. Afterwards, I generalize this method to quickly report all co-optimal or suboptimal mapping locations per read within a user-de ined error rate; this method, packaged in a tool called Yara, provides amore practical, yet sound solution to the readmapping problem. Extensive evaluations, both on simulated and real datasets, show that Yara has better speed and accuracy than de-facto standard read mapping tools.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Approximate string matching for high-throughput sequencing

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Distinguish vaccine strain and wild type strain of yellow fever virus imported to China using high-throughput sequencing technology
...
-
, et. al. ...
30 Aug 2017
30 Aug 2017

GenASM: A High-Performance, Low-Power Approximate String Matching Acceleration Framework for Genome Sequence Analysis
Damla Senol Cali ... Lavanya Subramanian
-
Damla Senol Cali, et. al.Damla Senol Cali ... Lavanya Subramanian
01 Oct 2020
01 Oct 2020

Identifying the pathogens of one patient with upper respiratory infection using high throughput sequencing technology
...
-
, et. al. ...
30 Oct 2016
30 Oct 2016

Integrated study of dinoflagellate diversity in the Gulf of Naples

-

25 May 2018
25 May 2018

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Approximate string matching for high-throughput sequencing

Abstract

Talk to us

Similar Papers