Optimal reference sequence selection for genome assembly using minimum description length principle

Bilal Wajid,Hazem Nounou,Mohamed Nounou,Erchin Serpedin

doi:10.1186/1687-4153-2012-18

Abstract

Reference assisted assembly requires the use of a reference sequence, as a model, to assist in the assembly of the novel genome. The standard method for identifying the best reference sequence for the assembly of a novel genome aims at counting the number of reads that align to the reference sequence, and then choosing the reference sequence which has the highest number of reads aligning to it. This article explores the use of minimum description length (MDL) principle and its two variants, the two-part MDL and Sophisticated MDL, in identifying the optimal reference sequence for genome assembly. The article compares the MDL based proposed scheme with the standard method coming to the conclusion that “counting the number of reads of the novel genome present in the reference sequence” is not a sufficient condition. Therefore, the proposed MDL scheme includes within itself the standard method of “counting the number of reads that align to the reference sequence” and also moves forward towards looking at the model, the reference sequence, as well, in identifying the optimal reference sequence. The proposed MDL based scheme not only becomes the sufficient criterion for identifying the optimal reference sequence for genome assembly but also improves the reference sequence so that it becomes more suitable for the assembly of the novel genome.

Highlights

Rissanen’s minimum description length (MDL) is an inference tool that learns regular features in the data by data compression
The relevance of MDL to Genome assembly can be realized by understanding that Genome assembly is an inference problem where the task at hand is to infer the novel genome from read data obtained from sequencing
The MDL process was analyzed on synthetic data on four different sets of mutations by varying the number and length of {Single nucleotide polymorphisms (SNPs), Inversions, Insertions, and Deletions}

Summary

Introduction

Rissanen’s minimum description length (MDL) is an inference tool that learns regular features in the data by data compression. MDL uses “code-length” as a measure to identify the best model amongst a set of models. The model which compresses the data the most and presents the smallest code-length is considered the best model. MDL principle tries to find the simplest explanation (model) to the phenomenon (data). The MDL principle has been used successfully in inferring the structure of gene regulatory networks [6,7,8,9,10,11,12,13], The article is organized as follows.

Methods

Results

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: EURASIP Journal on Bioinformatics and Systems Biology	Publication Date: Nov 27, 2012
Citations: 47	License type: CC BY 2.0

R Discovery Prime

R Discovery Prime

Optimal reference sequence selection for genome assembly using minimum description length principle

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: EURASIP Journal on Bioinformatics and Systems Biology

Lead the way for us

Similar Papers

Optimal reference selection for genome assembly using the minimum description length principle
Bilal Wajid ... Marwa Qaraqe
-
Bilal Wajid, et. al.Bilal Wajid ... Marwa Qaraqe
01 Jan 2012
01 Jan 2012

MiB: A comparative assembly processing pipeline
Bilal Wajid ... Erchin Serpedin
-
Bilal Wajid, et. al.Bilal Wajid ... Erchin Serpedin
01 Dec 2012
01 Dec 2012

Minimum description length based selection of reference sequences for comparative assemblers
Bilal Wajid ... Erchin Serpedin
-
Bilal Wajid, et. al.Bilal Wajid ... Erchin Serpedin
01 Dec 2011
01 Dec 2011

Enhanced minimum description length preprocessing of time series trajectories
Gajanan Gawde ... Jyoti Pawar
-
Gajanan Gawde, et. al.Gajanan Gawde ... Jyoti Pawar
01 Mar 2017
01 Mar 2017

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Optimal reference sequence selection for genome assembly using minimum description length principle

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: EURASIP Journal on Bioinformatics and Systems Biology