Abstract

BackgroundThe classification of protein domains in the CATH resource is primarily based on structural comparisons, sequence similarity and manual analysis. One of the main bottlenecks in the processing of new entries is the evaluation of 'borderline' cases by human curators with reference to the literature, and better tools for helping both expert and non-expert users quickly identify relevant functional information from text are urgently needed. A text based method for protein classification is presented, which complements the existing sequence and structure-based approaches, especially in cases exhibiting low similarity to existing members and requiring manual intervention. The method is based on the assumption that textual similarity between sets of documents relating to proteins reflects biological function similarities and can be exploited to make classification decisions.ResultsAn optimal strategy for the text comparisons was identified by using an established gold standard enzyme dataset. Filtering of the abstracts using a machine learning approach to discriminate sentences containing functional, structural and classification information that are relevant to the protein classification task improved performance. Testing this classification scheme on a dataset of 'borderline' protein domains that lack significant sequence or structure similarity to classified proteins showed that although, as expected, the structural similarity classifiers perform better on average, there is a significant benefit in incorporating text similarity in logistic regression models, indicating significant orthogonality in this additional information. Coverage was significantly increased especially at low error rates, which is important for routine classification tasks: 15.3% for the combined structure and text classifier compared to 10% for the structural classifier alone, at 10-3 error rate. Finally when only the highest scoring predictions were used to infer classification, an extra 4.2% of correct decisions were made by the combined classifier.ConclusionWe have described a simple text based method to classify protein domains that demonstrates an improvement over existing methods. The method is unique in incorporating structural and text based classifiers directly and is particularly useful in cases where inconclusive evidence from sequence or structure similarity requires laborious manual classification.

Highlights

  • The classification of protein domains in the CATH resource is primarily based on structural comparisons, sequence similarity and manual analysis

  • We have developed a novel text based approach for protein classification, which is based in text similarity of documents related to proteins, with a view to support the curation of protein structure databases

  • Combined structure and text classifier outperforms structural similarity in protein classification of 'borderline' cases in CATH An all-versus-all text comparison was performed using DC1.1993 as the query set and textCATH as the reference set according to the optimal conditions identified in the gold standard enzyme dataset

Read more

Summary

Introduction

The classification of protein domains in the CATH resource is primarily based on structural comparisons, sequence similarity and manual analysis. It may be possible to classify protein domains purely on the basis of clear sequence and structural similarity, there are many cases that exhibit 'borderline' or low similarity to existing members which require laborious manual classification. This manual classification usually requires study of the relevant literature, and so classification of these 'borderline' domains may benefit from automated literature analysis. To address this need, text mining based methods may complement the existing molecular computational approaches, especially in cases where the evidence from such sequence and structural similarities is inconclusive

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call