Abstract

BackgroundAnnotations that describe the function of sequences are enormously important to researchers during laboratory investigations and when making computational inferences. However, there has been little investigation into the data quality of sequence function annotations. Here we have developed a new method of estimating the error rate of curated sequence annotations, and applied this to the Gene Ontology (GO) sequence database (GOSeqLite). This method involved artificially adding errors to sequence annotations at known rates, and used regression to model the impact on the precision of annotations based on BLAST matched sequences.ResultsWe estimated the error rate of curated GO sequence annotations in the GOSeqLite database (March 2006) at between 28% and 30%. Annotations made without use of sequence similarity based methods (non-ISS) had an estimated error rate of between 13% and 18%. Annotations made with the use of sequence similarity methodology (ISS) had an estimated error rate of 49%.ConclusionWhile the overall error rate is reasonably low, it would be prudent to treat all ISS annotations with caution. Electronic annotators that use ISS annotations as the basis of predictions are likely to have higher false prediction rates, and for this reason designers of these systems should consider avoiding ISS annotations where possible. Electronic annotators that use ISS annotations to make predictions should be viewed sceptically. We recommend that curators thoroughly review ISS annotations before accepting them as valid. Overall, users of curated sequence annotations from the GO database should feel assured that they are using a comparatively high quality source of information.

Highlights

  • Annotations that describe the function of sequences are enormously important to researchers during laboratory investigations and when making computational inferences

  • We have developed a method to undertake systematic analysis of Gene Ontology (GO) term annotation error in sequence annotation databases, and used this to estimate the GO term annotation error rate of the GoSeqLite sequence annotation database

  • We found that the overall error rate is 28%– 30%, and that GO term annotations not based on sequence similarity have a far lower error rate than those that are, with error rates of 13%–18% and 49% respectively

Read more

Summary

Introduction

Annotations that describe the function of sequences are enormously important to researchers during laboratory investigations and when making computational inferences. There has been little investigation into the data quality of sequence function annotations. We have developed a new method of estimating the error rate of curated sequence annotations, and applied this to the Gene Ontology (GO) sequence database (GOSeqLite). While using expert curators to assign functions to sequences might be considered to be the least error prone approach, this option is far slower than annotation by automated software approaches. Artamonova et al (2005) [3] examined the error rate of UniProt/SwissProt database annotations, consisting of five distinct types of annotation entries, and found an error rate of between 33% and 43% As this database is widely considered to have a very high standard of curation, we might infer that other sequence databases have at least this annotation error rate, if not higher

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.