Abstract

Background and Objectives: Reference assisted assembly requires the use of a reference sequence, as a model, to assist in the assembly of novel genomes. The standard method for identifying the best reference sequence for the assembly of a novel genome aims at counting the number of reads of the novel genome that align to the reference sequences and then choosing the reference sequence which has the highest number of reads aligning to it. This work explores the use of minimum description length (MDL) principle and its two variants, the two-part MDL and sophisticated MDL, in identifying the optimal reference sequence for genome assembly. Methods: The relevance of MDL to genome assembly can be realized by understanding that genome assembly is an inference problem where the task at hand is to infer the novel genome from read data obtained from sequencing. The task of MDL is to identify the model that best describes the data and within comparative assembly framework the same meaning applies to finding the reference sequences that best describe the set of reads. This work explores the potential of three variants of MDL: two-part MDL, sophisticated MDL and minimax regret for the selection of the optimal reference sequence for comparative assembly. Results: The proposed scheme based on sophisticated MDL has been shown to work successfully for the four possible set of mutations: SNPs, insertions, inversions and deletions. The proposed scheme chooses the reference sequence which has the smaller number of SNPs, insertions and deletions. The MDL scheme is able to detect all inversions and rectify them. Conclusions: The work compared the MDL scheme with the standard method of counting the number of reads that align to the reference sequence, and found that though the standard method is a necessary condition for finding the optimal sequence, it is not the sufficient condition. Therefore, the proposed MDL scheme encompassed within itself the standard method of: counting the number of reads, by defining it in an inverted fashion as counting the number of reads that did not align to the reference sequence.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.