Abstract
Only a small fraction of genes deposited to databases have been experimentally characterised. The majority of proteins have their function assigned automatically, which can result in erroneous annotations. The reliability of current annotations in public databases is largely unknown; experimental attempts to validate the accuracy within individual enzyme classes are lacking. In this study we performed an overview of functional annotations to the BRENDA enzyme database. We first applied a high-throughput experimental platform to verify functional annotations to an enzyme class of S-2-hydroxyacid oxidases (EC 1.1.3.15). We chose 122 representative sequences of the class and screened them for their predicted function. Based on the experimental results, predicted domain architecture and similarity to previously characterised S-2-hydroxyacid oxidases, we inferred that at least 78% of sequences in the enzyme class are misannotated. We experimentally confirmed four alternative activities among the misannotated sequences and showed that misannotation in the enzyme class increased over time. Finally, we performed a computational analysis of annotations to all enzyme classes in the BRENDA database, and showed that nearly 18% of all sequences are annotated to an enzyme class while sharing no similarity or domain architecture to experimentally characterised representatives. We showed that even well-studied enzyme classes of industrial relevance are affected by the problem of functional misannotation.
Highlights
With the steady increase of genetic information deposited to public databases, the proportion of experimentally characterised sequences continues to decline
Correct annotation of genomes is crucial for our understanding and utilization of functional gene diversity, yet the reliability of current protein annotations in public databases is largely unknown
We showed that the misannotation is widespread throughout enzyme classes, affecting even well-studied classes of industrial relevance
Summary
With the steady increase of genetic information deposited to public databases, the proportion of experimentally characterised sequences continues to decline. As the traditional experimental methods for determining protein function cannot keep up with the increase in genomic data, high-throughput methods enabling protein family-wide substrate profiling for hundreds of enzymes are being implemented. Data generated in such approaches are important for understanding sequencefunction relationships in the tested protein families; they have led to the discovery of novel enzymatic activities as well as identified enzymes with diverse physicochemical properties [2,3,4,5,6]. Several global initiatives have been undertaken to bring together computational and experimental scientists to accelerate discovery of novel protein activities and enable more trustworthy functional annotations [7,8,9]
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.