Abstract

BackgroundIn most flowering plants, the plastid genome exhibits a quadripartite genome structure, comprising a large and a small single copy as well as two inverted repeat regions. Thousands of plastid genomes have been sequenced and submitted to public sequence repositories in recent years. The quality of sequence annotations in many of these submissions is known to be problematic, especially regarding annotations that specify the length and location of the inverted repeats: such annotations are either missing or portray the length or location of the repeats incorrectly. However, many biological investigations employ publicly available plastid genomes at face value and implicitly assume the correctness of their sequence annotations.ResultsWe introduce airpg, a Python package that automatically assesses the frequency of incomplete or incorrect annotations of the inverted repeats among publicly available plastid genomes. Specifically, the tool automatically retrieves plastid genomes from NCBI Nucleotide under variable search parameters, surveys them for length and location specifications of inverted repeats, and confirms any inverted repeat annotations through self-comparisons of the genome sequences. The package also includes functionality for automatic identification and removal of duplicate genome records and accounts for taxa that genuinely lack inverted repeats. A survey of the presence of inverted repeat annotations among all plastid genomes of flowering plants submitted to NCBI Nucleotide until the end of 2020 using airpg, followed by a statistical analysis of potential associations with record metadata, highlights that release year and publication status of the genome records have a significant effect on the frequency of complete and equal-length inverted repeat annotations.ConclusionThe number of plastid genomes on NCBI Nucleotide has increased dramatically in recent years, and many more genomes will likely be submitted over the next decade. airpg enables researchers to automatically access and evaluate the inverted repeats of these plastid genomes as well as their sequence annotations and, thus, contributes to increasing the reliability of publicly available plastid genomes. The software is freely available via the Python package index at http://pypi.python.org/pypi/airpg.

Highlights

  • In most flowering plants, the plastid genome exhibits a quadripartite genome structure, comprising a large and a small single copy as well as two inverted repeat regions

  • The plastid genome presented by Dempewolf et al [11], for example, exhibits nucleotide polymorphisms between the un-annotated Inverted repeat (IR) and represents one of the many cases where plastid genomes with either non-identical IRs or incomplete IR annotations were submitted to public sequence databases without highlighting the observed differences [12]

  • The number of plastid genomes deposited to National Center for Biotechnology Information (NCBI) Nucleotide has increased dramatically in recent years, and thousands of additional plastid genomes will likely be submitted over the decade

Read more

Summary

Introduction

The plastid genome exhibits a quadripartite genome structure, comprising a large and a small single copy as well as two inverted repeat regions. The typical plastid genome of flowering plants comprises a large (LSC) and a small single copy (SSC) region, separated by two identical inverted repeats (IRs) [2]. The software OGDraw [13], for example, employs exact string-matching when determining the location of the IRs within the genome during the plotting of complete plastid genomes and dismisses sequence regions that contain nucleotide polymorphisms from consideration as possible IRs. the software Chloroplot [12] operates under the assumption of IR equality in plastid genomes and explicitly highlights the differences between IRs that are found to be non-identical. Plastid genomes stored on public sequence repositories should contain complete and correct annotations regarding IR length and location [15]

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call