Heterozygous genome assembly via binary classification of homologous sequence.

Paul M Bodily,Cameron Ortega,Jared C Price,M Stanley Fujimoto,Quinn Snell,Nozomu Okuda,Mark J Clement

doi:10.1186/1471-2105-16-s7-s5

Abstract

BackgroundGenome assemblers to date have predominantly targeted haploid reference reconstruction from homozygous data. When applied to diploid genome assembly, these assemblers perform poorly, owing to the violation of assumptions during both the contigging and scaffolding phases. Effective tools to overcome these problems are in growing demand. Increasing parameter stringency during contigging is an effective solution to obtaining haplotype-specific contigs; however, effective algorithms for scaffolding such contigs are lacking.MethodsWe present a stand-alone scaffolding algorithm, ScaffoldScaffolder, designed specifically for scaffolding diploid genomes. The algorithm identifies homologous sequences as found in "bubble" structures in scaffold graphs. Machine learning classification is used to then classify sequences in partial bubbles as homologous or non-homologous sequences prior to reconstructing haplotype-specific scaffolds. We define four new metrics for assessing diploid scaffolding accuracy: contig sequencing depth, contig homogeneity, phase group homogeneity, and heterogeneity between phase groups.ResultsWe demonstrate the viability of using bubbles to identify heterozygous homologous contigs, which we term homolotigs. We show that machine learning classification trained on these homolotig pairs can be used effectively for identifying homologous sequences elsewhere in the data with high precision (assuming error-free reads).ConclusionMore work is required to comparatively analyze this approach on real data with various parameters and classifiers against other diploid genome assembly methods. However, the initial results of ScaffoldScaffolder supply validity to the idea of employing machine learning in the difficult task of diploid genome assembly. Software is available at http://bioresearch.byu.edu/scaffoldscaffolder.

Highlights

Genome assemblers to date have predominantly targeted haploid reference reconstruction from homozygous data
Initial de novo assembly algorithms were designed to essentially ignore any variation that may have existed between haplotypes
We present ScaffoldScaffolder, a diploid genome assembly approach which includes a newly developed scaffolding module to resolve haplotype-specific scaffolds

Summary

Introduction

Genome assemblers to date have predominantly targeted haploid reference reconstruction from homozygous data. A genome contains all of the genetic information needed for an organism to live and represents a trove of data for seeking to understand the complex mechanisms governing all life. Proper analysis of these data presupposes a correctness of the reconstructed genomic sequence, which continues to motivate the need for assembly algorithms which produce assemblies from next-generation sequence data with greater completeness and correctness. Genome assemblers have traditionally been designed to assemble haploid genomes [2,3,4] This was motivated in the first place by the vast array of monoploid bacterial genomes being sequenced and later on by the ease with which the two haplotypes of many diploid species could be made homogenous or homozygous enough (via inbreeding) to nearly approximate a monoploid specimen. Initial de novo assembly algorithms were designed to essentially ignore any variation that may have existed between haplotypes

Methods

Results

Conclusion