Abstract

Recognizing specific biological concepts described in text is an important task that is receiving increasing attention in bioinformatics. To leverage the literature effectively, sophisticated data analysis algorithms must be able to identify key biological concepts and functions in text. However, biomedical text is complex and diverse in subject matter and lexicon. Very specialized vocabularies have been developed to describe biological complexity. In addition, using computational approaches to understand text in general has been a historically challenging subject (Rosenfeld 2000). In this chapter we will focus on the basics of understanding the content of biological text. We will describe common text classification algorithms. We demonstrate how these algorithms can be applied to the specific biological problem of gene annotation. But text classification is also potentially instrumental to many other areas of bioinformatics; we will see other applications in Chapter 10. There is great interest in assigning functional annotations to genes from the scientific literature. In one recent symposium 33 groups proposed and implemented classification algorithms to identify articles that were specifically relevant for gene function annotation (Hersh, Bhuporaju et al. 2004). In another recent symposium, seven groups competed to assign Gene Ontology function codes to genes from primary text (Valencia, Blaschke et al. 2004). In this chapter we assign biological function codes to genes automatically to investigate the extent to which computational approaches can be applied to identify relevant biological concepts in text about genes directly. Each code represents a specific biological function such as ‘‘signal transduction’’ or ‘‘cell cycle’’. The key concepts in this chapter are presented in the frame box. We introduce three text classification methods that can be used to associate functional codes to a set of literature abstracts. We describe and test maximum entropy modeling, naive Bayes classification, and nearest neighbor classification. Maximum entropy modeling outperforms the other methods, and assigns appropriate functions to articles with an accuracy of 72%. The maximum entropy method provides confidence measures that correlate well with performance.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.