Real or fake? Measuring the impact of protein annotation errors on estimates of domain gain and loss events.

Arnaud Kress,Julie D Thompson,Odile Lecompte,Olivier Poch

doi:10.3389/fbinf.2023.1178926

Abstract

Protein annotation errors can have significant consequences in a wide range of fields, ranging from protein structure and function prediction to biomedical research, drug discovery, and biotechnology. By comparing the domains of different proteins, scientists can identify common domains, classify proteins based on their domain architecture, and highlight proteins that have evolved differently in one or more species or clades. However, genome-wide identification of different protein domain architectures involves a complex error-prone pipeline that includes genome sequencing, prediction of gene exon/intron structures, and inference of protein sequences and domain annotations. Here we developed an automated fact-checking approach to distinguish true domain loss/gain events from false events caused by errors that occur during the annotation process. Using genome-wide ortholog sets and taking advantage of the high-quality human and Saccharomyces cerevisiae genome annotations, we analyzed the domain gain and loss events in the predicted proteomes of 9 non-human primates (NHP) and 20 non-S. cerevisiae fungi (NSF) as annotated in the Uniprot and Interpro databases. Our approach allowed us to quantify the impact of errors on estimates of protein domain gains and losses, and we show that domain losses are over-estimated ten-fold and three-fold in the NHP and NSF proteins respectively. This is in line with previous studies of gene-level losses, where issues with genome sequencing or gene annotation led to genes being falsely inferred as absent. In addition, we show that insistent protein domain annotations are a major factor contributing to the false events. For the first time, to our knowledge, we show that domain gains are also over-estimated by three-fold and two-fold respectively in NHP and NSF proteins. Based on our more accurate estimates, we infer that true domain losses and gains in NHP with respect to humans are observed at similar rates, while domain gains in the more divergent NSF are observed twice as frequently as domain losses with respect to S. cerevisiae. This study highlights the need to critically examine the scientific validity of protein annotations, and represents a significant step toward scalable computational fact-checking methods that may 1day mitigate the propagation of wrong information in protein databases.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Frontiers in Bioinformatics	Publication Date: Apr 20, 2023
Citations: 4	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Real or fake? Measuring the impact of protein annotation errors on estimates of domain gain and loss events.

Abstract

Talk to us

Similar Papers

More From: Frontiers in Bioinformatics

Lead the way for us

Similar Papers

Automatic Generation of Functional Annotation Rules Using Inferred GO-Domain Associations
...
-
, et. al. ...
08 Aug 2017
08 Aug 2017

SMART 7: recent updates to the protein domain annotation resource
I Letunic ... P Bork
Nucleic Acids Research | VOL. 40
I Letunic, et. al.I Letunic ... P Bork
03 Nov 2011
Nucleic Acids Research | VOL. 40

SMART: recent updates, new developments and status in 2020.
Ivica Letunic ... Supriya Khedkar
Nucleic Acids Research | VOL. 49
Ivica Letunic, et. al.Ivica Letunic ... Supriya Khedkar
26 Oct 2020
SMART: recent updates, new developments and status in 2020.
Ivica Letunic ... Supriya Khedkar

Semi-Supervised Pipeline for Autonomous Annotation of SARS-CoV-2 Genomes.
Kristen L Beck ... Gowri Nayar
Viruses | VOL. 13
Kristen L Beck, et. al.Kristen L Beck ... Gowri Nayar
03 Dec 2021
Viruses | VOL. 13

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Real or fake? Measuring the impact of protein annotation errors on estimates of domain gain and loss events.

Abstract

Talk to us

Similar Papers

More From: Frontiers in Bioinformatics