HiMMe: using genetic patterns as a proxy for genome assembly reliability assessment

Jordi Abante,Noushin Ghaffari,Charles D Johnson,Aniruddha Datta

doi:10.1186/s12864-017-3965-2

Jordi Abante, Noushin Ghaffari + Show 2 more

Open Access

https://doi.org/10.1186/s12864-017-3965-2

Copy DOI

Abstract

BackgroundThe information content of genomes plays a crucial role in the existence and proper development of living organisms. Thus, tremendous effort has been dedicated to developing DNA sequencing technologies that provide a better understanding of the underlying mechanisms of cellular processes. Advances in the development of sequencing technology have made it possible to sequence genomes in a relatively fast and inexpensive way. However, as with any measurement technology, there is noise involved and this needs to be addressed to reach conclusions based on the resulting data. In addition, there are multiple intermediate steps and degrees of freedom when constructing genome assemblies that lead to ambiguous and inconsistent results among assemblers.MethodsHere we introduce HiMMe, an HMM-based tool that relies on genetic patterns to score genome assemblies. Through a Markov chain, the model is able to detect characteristic genetic patterns, while, by introducing emission probabilities, the noise involved in the process is taken into account. Prior knowledge can be used by training the model to fit a given organism or sequencing technology.ResultsOur results show that the method presented is able to recognize patterns even with relatively small k-mer size choices and limited computational resources.ConclusionsOur methodology provides an individual quality metric per contig in addition to an overall genome assembly score, with a time complexity well below that of an aligner. Ultimately, HiMMe provides meaningful statistical insights that can be leveraged by researchers to better select contigs and genome assemblies for downstream analysis.

Highlights

When constructing genome assemblies there are multiple intermediate steps that can significantly impact the results obtained
We considered simulated data to show how the algorithm is able to recognize genetic patterns in the input data
We used our method to study the genome assemblies generated by the Genome Assembly Gold-Standard Evaluations (GAGE) project [14]

Summary

Introduction

When constructing genome assemblies there are multiple intermediate steps that can significantly impact the results obtained. Even when using the same input data, an assembler will tend to lead to different genome assemblies depending on its particular configuration. This will be the case when comparing the output of different assemblers. There are multiple intermediate steps and degrees of freedom when constructing genome assemblies that lead to ambiguous and inconsistent results among assemblers. This section formally defines Markov chains and hidden Markov models. It provides the necessary background material needed for developing our method. We denote the set of all possible sequences of states of length m as m.

Objectives

Methods

Results

Conclusion