Using Text Classification for Gene Function Annotation

Soumya Raychaudhuri

doi:10.1093/oso/9780198567400.003.0015

Abstract

Recognizing specific biological concepts described in text is an important task that is receiving increasing attention in bioinformatics. To leverage the literature effectively, sophisticated data analysis algorithms must be able to identify key biological concepts and functions in text. However, biomedical text is complex and diverse in subject matter and lexicon. Very specialized vocabularies have been developed to describe biological complexity. In addition, using computational approaches to understand text in general has been a historically challenging subject (Rosenfeld 2000). In this chapter we will focus on the basics of understanding the content of biological text. We will describe common text classification algorithms. We demonstrate how these algorithms can be applied to the specific biological problem of gene annotation. But text classification is also potentially instrumental to many other areas of bioinformatics; we will see other applications in Chapter 10. There is great interest in assigning functional annotations to genes from the scientific literature. In one recent symposium 33 groups proposed and implemented classification algorithms to identify articles that were specifically relevant for gene function annotation (Hersh, Bhuporaju et al. 2004). In another recent symposium, seven groups competed to assign Gene Ontology function codes to genes from primary text (Valencia, Blaschke et al. 2004). In this chapter we assign biological function codes to genes automatically to investigate the extent to which computational approaches can be applied to identify relevant biological concepts in text about genes directly. Each code represents a specific biological function such as ‘‘signal transduction’’ or ‘‘cell cycle’’. The key concepts in this chapter are presented in the frame box. We introduce three text classification methods that can be used to associate functional codes to a set of literature abstracts. We describe and test maximum entropy modeling, naive Bayes classification, and nearest neighbor classification. Maximum entropy modeling outperforms the other methods, and assigns appropriate functions to articles with an accuracy of 72%. The maximum entropy method provides confidence measures that correlate well with performance.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Using Text Classification for Gene Function Annotation

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Text Classification in Architecture Field Based on Naive Bayes Algorithm
Xinyi Sun ... Liming Du
-
Xinyi Sun, et. al.Xinyi Sun ... Liming Du
01 Jun 2022
01 Jun 2022

MPEG VBR video traffic classification using Bayesian and nearest neighbor classifiers
Qilian Liang
-
Qilian Liang Qilian Liang
07 Aug 2002
07 Aug 2002

Application of improved distributed naive Bayesian algorithms in text classification
Hongyi Gao ... Chunhua Yao
The Journal of Supercomputing | VOL. 75
Hongyi Gao, et. al.Hongyi Gao ... Chunhua Yao
30 Apr 2019
The Journal of Supercomputing | VOL. 75

Text classification based on the TAN model
Shi Hong-Bo ... Wang Zhi-Hai
-
Shi Hong-Bo, et. al. Shi Hong-Bo ... Wang Zhi-Hai
28 Oct 2002
28 Oct 2002

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Using Text Classification for Gene Function Annotation

Abstract

Talk to us

Similar Papers