Abstract
BackgroundGenome assemblers to date have predominantly targeted haploid reference reconstruction from homozygous data. When applied to diploid genome assembly, these assemblers perform poorly, owing to the violation of assumptions during both the contigging and scaffolding phases. Effective tools to overcome these problems are in growing demand. Increasing parameter stringency during contigging is an effective solution to obtaining haplotype-specific contigs; however, effective algorithms for scaffolding such contigs are lacking.MethodsWe present a stand-alone scaffolding algorithm, ScaffoldScaffolder, designed specifically for scaffolding diploid genomes. The algorithm identifies homologous sequences as found in "bubble" structures in scaffold graphs. Machine learning classification is used to then classify sequences in partial bubbles as homologous or non-homologous sequences prior to reconstructing haplotype-specific scaffolds. We define four new metrics for assessing diploid scaffolding accuracy: contig sequencing depth, contig homogeneity, phase group homogeneity, and heterogeneity between phase groups.ResultsWe demonstrate the viability of using bubbles to identify heterozygous homologous contigs, which we term homolotigs. We show that machine learning classification trained on these homolotig pairs can be used effectively for identifying homologous sequences elsewhere in the data with high precision (assuming error-free reads).ConclusionMore work is required to comparatively analyze this approach on real data with various parameters and classifiers against other diploid genome assembly methods. However, the initial results of ScaffoldScaffolder supply validity to the idea of employing machine learning in the difficult task of diploid genome assembly. Software is available at http://bioresearch.byu.edu/scaffoldscaffolder.
Highlights
Genome assemblers to date have predominantly targeted haploid reference reconstruction from homozygous data
Initial de novo assembly algorithms were designed to essentially ignore any variation that may have existed between haplotypes
We present ScaffoldScaffolder, a diploid genome assembly approach which includes a newly developed scaffolding module to resolve haplotype-specific scaffolds
Summary
Genome assemblers to date have predominantly targeted haploid reference reconstruction from homozygous data. A genome contains all of the genetic information needed for an organism to live and represents a trove of data for seeking to understand the complex mechanisms governing all life. Proper analysis of these data presupposes a correctness of the reconstructed genomic sequence, which continues to motivate the need for assembly algorithms which produce assemblies from next-generation sequence data with greater completeness and correctness. Genome assemblers have traditionally been designed to assemble haploid genomes [2,3,4] This was motivated in the first place by the vast array of monoploid bacterial genomes being sequenced and later on by the ease with which the two haplotypes of many diploid species could be made homogenous or homozygous enough (via inbreeding) to nearly approximate a monoploid specimen. Initial de novo assembly algorithms were designed to essentially ignore any variation that may have existed between haplotypes
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.