The Challenge of Genome Sequence Assembly

Andrew Collins

doi:10.2174/1875036201811010231

Abstract

Background: Although whole genome sequencing is enabling numerous advances in many fields achieving complete chromosome-level sequence assemblies for diverse species presents difficulties. The problems in part reflect the limitations of current sequencing technologies. Chromosome assembly from ‘short read’ sequence data is confounded by the presence of repetitive genome regions with numerous similar sequence tracts which cannot be accurately positioned in the assembled sequence. Longer sequence reads often have higher error rates and may still be too short to span the larger gaps between contigs. Objective: Given the emergence of exciting new applications using sequencing technology, such as the Earth BioGenome Project, it is necessary to further develop and apply a range of strategies to achieve robust chromosome-level sequence assembly. Reviewed here are a range of methods to enhance assembly which include the use of cross-species synteny to understand relationships between sequence contigs, the development of independent genetic and/or physical scaffold maps as frameworks for assembly (for example, radiation hybrid, optical motif and chromatin interaction maps) and the use of patterns of linkage disequilibrium to help position, orient and locate contigs. Results and Conclusion: A range of methods exist which might be further developed to facilitate cost-effective large-scale sequence assembly for diverse species. A combination of strategies is required to best assemble sequence data into chromosome-level assemblies. There are a number of routes towards the development of maps which span chromosomes (including physical, genetic and linkage disequilibrium maps) and construction of these whole chromosome maps greatly facilitates the ordering and orientation of sequence contigs.

Highlights

The construction of whole chromosome assemblies from sequencing data is recognised as one of the most challenging problems in modern genomics [1]
The EBP [2] aims to construct high quality reference genomes for a member of each eukaryotic family to guide the assembly of lower quality, but informative, sequence builds for other species in the same family
Because assembly algorithms are focussed on developing longer contigs incorrect joins and orientations are a possibility [9] sequence polishing needs to expect this possibility and revision using alternative sources of information have the potential to resolve these errors

Summary

Introduction

The construction of whole chromosome assemblies from sequencing data is recognised as one of the most challenging problems in modern genomics [1]. The three main project goals focus, firstly, on understanding evolutionary relationships between organisms and obtain insights into ecosystem composition, genome evolution and the acquisition of knowledge to accelerate the discovery of new species. The genomes of only ~2500 eukaryotic species have been completely or partly sequenced and so only a tiny fraction of the predicted 10-15 million eukaryotic species is currently available for analysis at the genome level. The EBP aims to start with the 1.5 million named eukaryotes with an initial target of obtaining quality reference sequences for a single member of each of the 9000 or so eukaryote families. Whole genome sequencing is enabling numerous advances in many fields achieving complete chromosome-level sequence assemblies for diverse species presents difficulties. Longer sequence reads often have higher error rates and may still be too short to span the larger gaps between contigs

Results

Discussion

Conclusion