Abstract

Reference-quality genomes are expected to provide a resource for studying gene structure, function, and evolution. However, often genes of interest are not completely or accurately assembled, leading to unknown errors in analyses or additional cloning efforts for the correct sequences. A promising solution is long-read sequencing. Here we tested PacBio-based long-read sequencing and diploid assembly for potential improvements to the Sanger-based intermediate-read zebra finch reference and Illumina-based short-read Anna's hummingbird reference, 2 vocal learning avian species widely studied in neuroscience and genomics. With DNA of the same individuals used to generate the reference genomes, we generated diploid assemblies with the FALCON-Unzip assembler, resulting in contigs with no gaps in the megabase range, representing 150-fold and 200-fold improvements over the current zebra finch and hummingbird references, respectively. These long-read and phased assemblies corrected and resolved what we discovered to be numerous misassemblies in the references, including missing sequences in gaps, erroneous sequences flanking gaps, base call errors in difficult-to-sequence regions, complex repeat structure errors, and allelic differences between the 2 haplotypes. These improvements were validated by single long-genome and transcriptome reads and resulted for the first time in completely resolved protein-coding genes widely studied in neuroscience and specialized in vocal learning species. These findings demonstrate the impact of long reads, sequencing of previously difficult-to-sequence regions, and phasing of haplotypes on generating the high-quality assemblies necessary for understanding gene structure, function, and evolution.

Highlights

  • Having available genomes of species of interest provides a powerful resource to rapidly conduct investigations on genes of interest

  • To generate long-read assemblies, high–molecular weight DNA was isolated from the muscle tissue of the same zebra finch male and Anna’s hummingbird female used to create the current reference genomes [2, 8]

  • The DNA was sheared, 35–40 kb libraries were generated, the DNA was size-selected for inserts >17 kb (Fig. S1), and SMRT sequencing was performed on the PacBio RS II instrument to obtain ∼×96 coverage for the zebra finch (19-kb N50 read length) and ∼×70 for the hummingbird (22-kb N50 read length) (Fig. S2)

Read more

Summary

Introduction

Having available genomes of species of interest provides a powerful resource to rapidly conduct investigations on genes of interest. The EGR1 immediate early gene transcription factor, a commonly studied gene in neuroscience and in vocal learning species, was missing the promoter region in a GC-rich region in every bird genome we examined (including the Sanger-based assemblies). Another immediate early gene, DUSP1, with specialized vocalizing-driven gene expression in song nuclei of vocal learning species, has microsatellite sequences in the promoters of vocal learning species that are missing or misassembled, requiring single-molecule cloning and sequencing to resolve [13]. Such errors create a great amount of effort to clone, sequence, and correct assemblies of individual genes of interest

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call