Abstract

BackgroundUnsupervised annotation of proteins by software pipelines suffers from very high error rates. Spurious functional assignments are usually caused by unwarranted homology-based transfer of information from existing database entries to the new target sequences. We have previously demonstrated that data mining in large sequence annotation databanks can help identify annotation items that are strongly associated with each other, and that exceptions from strong positive association rules often point to potential annotation errors. Here we investigate the applicability of negative association rule mining to revealing erroneously assigned annotation items.ResultsAlmost all exceptions from strong negative association rules are connected to at least one wrong attribute in the feature combination making up the rule. The fraction of annotation features flagged by this approach as suspicious is strongly enriched in errors and constitutes about 0.6% of the whole body of the similarity-transferred annotation in the PEDANT genome database. Positive rule mining does not identify two thirds of these errors. The approach based on exceptions from negative rules is much more specific than positive rule mining, but its coverage is significantly lower.ConclusionMining of both negative and positive association rules is a potent tool for finding significant trends in protein annotation and flagging doubtful features for further inspection.

Highlights

  • Unsupervised annotation of proteins by software pipelines suffers from very high error rates

  • The fraction of annotation features flagged by this approach as suspicious is strongly enriched in errors and constitutes about 0.6% of the whole body of the similarity-transferred annotation in the PEDANT genome database

  • There were 7318 such rules with 26969 exceptions in total. An example of such rules is "Nuclear protein, not Bacteria, 0.033, 1808, 0.647, 35620, 0.033, 1798, 0.994, 1.537, 0.011, 628.413". This statement which is obvious from the biological point of view does not make an absolute rule; out of all 1808 protein entries annotated by the keyword "Nuclear protein" in the PEDANT database only 1798 have eukaryotic origin

Read more

Summary

Introduction

Unsupervised annotation of proteins by software pipelines suffers from very high error rates. In silico annotation generated by bioinformatics methods has the advantage of being efficient and cheap, but at the same time suffers from a notoriously high error level [4,5]. Most of these errors are caused by homology-based annotation transfer where available similarity is not sufficient to warrant the transfer of information from the source to the target sequence, or because the annotation of the (page number not for citation purposes)

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call