Abstract

Anyone who has attempted to identify the responsible gene or mutation underlying a disease or mutant phenotype will know the importance of an accurate reference genome assembly. For complex vertebrate genomes, however, generating such an assembly is not trivial, even with new sequencing technologies. In 2007, a mid-point in the zebrafish genome sequencing project, I was asked to lead the project to completion. At that point we were faced with a highly fragmented physical assembly and lacked genetic maps of sufficient density and resolution to produce an accurate assembly. A high-quality reference genome assembly is generally made up of a large set of minimally overlapping large-insert genomic clones, each of which has been sequenced to completion, with a minimal number of gaps and with no artificially duplicated regions. These high-quality reference genome assemblies, such as the current human reference genome (http://www.genomereference.org), are essential for modern molecular genetic studies. For many species, however, only lower quality whole-genome shotgun assemblies are available. When one considers, for example, only the protein-coding genes, this quality of genome sequence is often not sufficient to determine the complete gene count or comprehensive set of accurate gene models. It is important, for the best application of the genomic information, that the reference genome be complete and accurately assembled. While high-throughput short-read sequencing using the current generation of machines will yield high quality for bacterial, and other small, genomes, it is not possible to completely and accurately assemble the large, complex genomes of vertebrates without other long-range contiguity information. Experience with the zebrafish genome [1] may provide some useful guidance for anyone embarking on a genome-sequencing project for new species with a complex genome. The human, mouse and zebrafish reference genomes were assembled using old-school approaches, where the long-range contiguity was derived from genetic or genomic mapping and not derived directly from sequencing reads or read-pairs. The maps used were accurate physical maps of overlapping genomic DNA fragments or high-resolution genetic maps with a high density of short sequence markers, but such maps are expensive and time-consuming to generate. There are some good possibilities for cheaper, easier way to generate accurate maps, but there are several issues to consider.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call