Abstract

The widespread occurrence of repetitive stretches of DNA in genomes of organisms across the tree of life imposes fundamental challenges for sequencing, genome assembly, and automated annotation of genes and proteins. This multi-level problem can lead to errors in genome and protein databases that are often not recognized or acknowledged. As a consequence, end users working with sequences with repetitive regions are faced with ‘ready-to-use’ deposited data whose trustworthiness is difficult to determine, let alone to quantify. Here, we provide a review of the problems associated with tandem repeat sequences that originate from different stages during the sequencing-assembly-annotation-deposition workflow, and that may proliferate in public database repositories affecting all downstream analyses. As a case study, we provide examples of the Atlantic cod genome, whose sequencing and assembly were hindered by a particularly high prevalence of tandem repeats. We complement this case study with examples from other species, where mis-annotations and sequencing errors have propagated into protein databases. With this review, we aim to raise the awareness level within the community of database users, and alert scientists working in the underlying workflow of database creation that the data they omit or improperly assemble may well contain important biological information valuable to others.

Highlights

  • The availability of DNA and protein sequence data has revolutionized the way we study cellular, molecular, physiological, evolutionary and developmental processes, allowing the association of phenotypes with genotypes at a single nucleotide resolution

  • There is no easy solution to this issue and the key purpose of this article is to raise the awareness to the problem, especially amongst end-users of genome and protein databases, but likewise amongst the researchers working on sequencing, assembly and annotation projects that are often not fully aware of the biological importance of the repeat regions that they mis-sequence, mask, or remove

  • We strongly encourage the use of long-read sequencing technologies to better capture the tandem repeats at the sequencing and assembly stages

Read more

Summary

Introduction

The availability of DNA and protein sequence data has revolutionized the way we study cellular, molecular, physiological, evolutionary and developmental processes, allowing the association of phenotypes with genotypes at a single nucleotide (or single amino acid) resolution. Researchers rely on public sequence depositories and other databases for sharing their data, such as GenBank or UniProt, and the content of these databases has grown exponentially in the last decades While such databases initially consisted predominantly of submissions of individual gene or protein sequences that were carefully curated, large proportions of the content of genome and protein databases today originate from different types of metagenome and genome sequencing and assembly projects. For an informed use of such data, it is essential that end users understand the distinct contrast in quality between individual, well-curated submissions and entries generated from automated sequence annotation pipelines. The latter procedures can contain unrecognized errors

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call