Abstract

Reference assisted assembly requires the use of a reference sequence, as a model, to assist in the assembly of the novel genome. The standard method for identifying the best reference sequence for the assembly of a novel genome aims at counting the number of reads that align to the reference sequence, and then choosing the reference sequence which has the highest number of reads aligning to it. This article explores the use of minimum description length (MDL) principle and its two variants, the two-part MDL and Sophisticated MDL, in identifying the optimal reference sequence for genome assembly. The article compares the MDL based proposed scheme with the standard method coming to the conclusion that “counting the number of reads of the novel genome present in the reference sequence” is not a sufficient condition. Therefore, the proposed MDL scheme includes within itself the standard method of “counting the number of reads that align to the reference sequence” and also moves forward towards looking at the model, the reference sequence, as well, in identifying the optimal reference sequence. The proposed MDL based scheme not only becomes the sufficient criterion for identifying the optimal reference sequence for genome assembly but also improves the reference sequence so that it becomes more suitable for the assembly of the novel genome.

Highlights

  • Rissanen’s minimum description length (MDL) is an inference tool that learns regular features in the data by data compression

  • The relevance of MDL to Genome assembly can be realized by understanding that Genome assembly is an inference problem where the task at hand is to infer the novel genome from read data obtained from sequencing

  • The MDL process was analyzed on synthetic data on four different sets of mutations by varying the number and length of {Single nucleotide polymorphisms (SNPs), Inversions, Insertions, and Deletions}

Read more

Summary

Introduction

Rissanen’s minimum description length (MDL) is an inference tool that learns regular features in the data by data compression. MDL uses “code-length” as a measure to identify the best model amongst a set of models. The model which compresses the data the most and presents the smallest code-length is considered the best model. MDL principle tries to find the simplest explanation (model) to the phenomenon (data). The MDL principle has been used successfully in inferring the structure of gene regulatory networks [6,7,8,9,10,11,12,13], The article is organized as follows.

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call