Abstract
Public sequencing databases are invaluable resources to biological researchers, but assessing data veracity as well as the curation and maintenance of such large collections of data can be challenging. Genomes of eukaryotic organelles, such as chloroplasts and other plastids, are particularly susceptible to assembly errors and misrepresentations in these databases due to their close evolutionary relationships with bacteria, which may co-occur within the same environment, as can be the case when sequencing plants. Here, based on sequence similarities with bacterial genomes, we identified several suspicious chloroplast assemblies present in the National Institutes of Health (NIH) Reference Sequence (RefSeq) collection. Investigations into these chloroplast assemblies reveal examples of erroneous integration of bacterial sequences into chloroplast ribosomal RNA (rRNA) loci, often within the rRNA genes, presumably due to the high similarity between plastid and bacterial rRNAs. The bacterial lineages identified within the examined chloroplasts as the most likely source of contamination are either known associates of plants, or co-occur in the same environmental niches as the examined plants. Modifications to the methods used to process untargeted ‘raw’ shotgun sequencing data from whole genome sequencing efforts, such as the identification and removal of bacterial reads prior to plastome assembly, could eliminate similar errors in the future.
Highlights
Available sequence databases, such as those in the International Nucleotide Sequence Database Collaboration (INSDC), are fundamental resources for many types of bioinformatic analyses
The ribosomal RNA (rRNA) regions from several Reference Sequence (RefSeq) chloroplast genome assemblies were examined for potential bacterial sequence contamination
The raw sequencing data used to generate these chloroplast genome assemblies was screened for bacterial contamination, and bacterial filtered data was compared to the assembly to assess impacts
Summary
Available sequence databases, such as those in the International Nucleotide Sequence Database Collaboration (INSDC), are fundamental resources for many types of bioinformatic analyses. With the increased availability of sequencing and a wide array of methods designed for routine use by genomics and bioinformatic novices, there is a constant need to monitor sequence entries and try to assess the quality and veracity of data within these public resources. Given that the nature of most bioinformatic analyses involve similarity searches to references in these databases, it is not uncommon to see transference of annotation and genome errors to other projects or analyses, Errors in Public Chloroplast References making it imperative to quickly identify and correct any erroneous submissions within these trusted databases, prior to the propagation of errors. Nuclear genomes are almost always published along with the raw sequencing data used to generate the assembly, but based on our examinations, it appears to be quite uncommon to find links to raw data for organelle genomes found in RefSeq or GenBank databases. Evaluation of plastomes would require special considerations, given their unique evolutionary relationships with bacteria, which complicate assessment of contamination
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.