Multiple Cases of Bacterial Sequence Erroneously Incorporated Into Publicly Available Chloroplast Genomes.

Aaron J. Robinson,Erick S. LeBrun,Julia M. Kelliher,Patrick S. G. Chain,Hajnalka E. Daligault

doi:10.3389/fgene.2021.821715

Aaron J. Robinson, Erick S. LeBrun + Show 3 more

Open Access

https://doi.org/10.3389/fgene.2021.821715

Copy DOI

Journal: Frontiers in genetics	Publication Date: Jan 13, 2022
Citations: 5	License type: CC BY 4.0

Affiliation: Los Alamos National Laboratory

Abstract

Public sequencing databases are invaluable resources to biological researchers, but assessing data veracity as well as the curation and maintenance of such large collections of data can be challenging. Genomes of eukaryotic organelles, such as chloroplasts and other plastids, are particularly susceptible to assembly errors and misrepresentations in these databases due to their close evolutionary relationships with bacteria, which may co-occur within the same environment, as can be the case when sequencing plants. Here, based on sequence similarities with bacterial genomes, we identified several suspicious chloroplast assemblies present in the National Institutes of Health (NIH) Reference Sequence (RefSeq) collection. Investigations into these chloroplast assemblies reveal examples of erroneous integration of bacterial sequences into chloroplast ribosomal RNA (rRNA) loci, often within the rRNA genes, presumably due to the high similarity between plastid and bacterial rRNAs. The bacterial lineages identified within the examined chloroplasts as the most likely source of contamination are either known associates of plants, or co-occur in the same environmental niches as the examined plants. Modifications to the methods used to process untargeted ‘raw’ shotgun sequencing data from whole genome sequencing efforts, such as the identification and removal of bacterial reads prior to plastome assembly, could eliminate similar errors in the future.

Highlights

Available sequence databases, such as those in the International Nucleotide Sequence Database Collaboration (INSDC), are fundamental resources for many types of bioinformatic analyses
The ribosomal RNA (rRNA) regions from several Reference Sequence (RefSeq) chloroplast genome assemblies were examined for potential bacterial sequence contamination
The raw sequencing data used to generate these chloroplast genome assemblies was screened for bacterial contamination, and bacterial filtered data was compared to the assembly to assess impacts

Summary

Introduction

Available sequence databases, such as those in the International Nucleotide Sequence Database Collaboration (INSDC), are fundamental resources for many types of bioinformatic analyses. With the increased availability of sequencing and a wide array of methods designed for routine use by genomics and bioinformatic novices, there is a constant need to monitor sequence entries and try to assess the quality and veracity of data within these public resources. Given that the nature of most bioinformatic analyses involve similarity searches to references in these databases, it is not uncommon to see transference of annotation and genome errors to other projects or analyses, Errors in Public Chloroplast References making it imperative to quickly identify and correct any erroneous submissions within these trusted databases, prior to the propagation of errors. Nuclear genomes are almost always published along with the raw sequencing data used to generate the assembly, but based on our examinations, it appears to be quite uncommon to find links to raw data for organelle genomes found in RefSeq or GenBank databases. Evaluation of plastomes would require special considerations, given their unique evolutionary relationships with bacteria, which complicate assessment of contamination

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Multiple Cases of Bacterial Sequence Erroneously Incorporated Into Publicly Available Chloroplast Genomes.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Frontiers in genetics

Lead the way for us

Similar Papers

Identifying Sources of Fecal Contamination Inexpensively with Targeted Sampling and Bacterial Source Tracking
Jennifer L Mcdonald ... Carolyn N Belcher
Journal of Environmental Quality | VOL. 35
Jennifer L Mcdonald, et. al.Jennifer L Mcdonald ... Carolyn N Belcher
01 May 2006
Journal of Environmental Quality | VOL. 35

Factors Influencing the Dry Heat Sensitivity of Salmonella enterica on Alfalfa Sprouting Seeds
Hudaa Neetoo ... Haiqiang Chen
Journal of Food Safety | VOL. 34
Hudaa Neetoo, et. al.Hudaa Neetoo ... Haiqiang Chen
04 Jul 2014
Journal of Food Safety | VOL. 34

Antibiotic resistance in Escherichia coli isolates from roof-harvested rainwater tanks and urban pigeon faeces as the likely source of contamination.
Lizyben Chidamba ... Lise Korsten
Environmental Monitoring and Assessment | VOL. 187
Lizyben Chidamba, et. al.Lizyben Chidamba ... Lise Korsten
05 Jun 2015
Environmental Monitoring and Assessment | VOL. 187

Comparative analysis of nuclear, chloroplast, and mitochondrial genomes of watermelon and melon provides evidence of gene transfer
Haonan Cui ... Zhuo Ding
Scientific Reports | VOL. 11
Haonan Cui, et. al.Haonan Cui ... Zhuo Ding
15 Jan 2021
Scientific Reports | VOL. 11

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Multiple Cases of Bacterial Sequence Erroneously Incorporated Into Publicly Available Chloroplast Genomes.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Frontiers in genetics