Abstract

BackgroundThe information content of genomes plays a crucial role in the existence and proper development of living organisms. Thus, tremendous effort has been dedicated to developing DNA sequencing technologies that provide a better understanding of the underlying mechanisms of cellular processes. Advances in the development of sequencing technology have made it possible to sequence genomes in a relatively fast and inexpensive way. However, as with any measurement technology, there is noise involved and this needs to be addressed to reach conclusions based on the resulting data. In addition, there are multiple intermediate steps and degrees of freedom when constructing genome assemblies that lead to ambiguous and inconsistent results among assemblers.MethodsHere we introduce HiMMe, an HMM-based tool that relies on genetic patterns to score genome assemblies. Through a Markov chain, the model is able to detect characteristic genetic patterns, while, by introducing emission probabilities, the noise involved in the process is taken into account. Prior knowledge can be used by training the model to fit a given organism or sequencing technology.ResultsOur results show that the method presented is able to recognize patterns even with relatively small k-mer size choices and limited computational resources.ConclusionsOur methodology provides an individual quality metric per contig in addition to an overall genome assembly score, with a time complexity well below that of an aligner. Ultimately, HiMMe provides meaningful statistical insights that can be leveraged by researchers to better select contigs and genome assemblies for downstream analysis.

Highlights

  • When constructing genome assemblies there are multiple intermediate steps that can significantly impact the results obtained

  • We considered simulated data to show how the algorithm is able to recognize genetic patterns in the input data

  • We used our method to study the genome assemblies generated by the Genome Assembly Gold-Standard Evaluations (GAGE) project [14]

Read more

Summary

Introduction

When constructing genome assemblies there are multiple intermediate steps that can significantly impact the results obtained. Even when using the same input data, an assembler will tend to lead to different genome assemblies depending on its particular configuration. This will be the case when comparing the output of different assemblers. There are multiple intermediate steps and degrees of freedom when constructing genome assemblies that lead to ambiguous and inconsistent results among assemblers. This section formally defines Markov chains and hidden Markov models. It provides the necessary background material needed for developing our method. We denote the set of all possible sequences of states of length m as m.

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call