Tutorial on text mining of biomedical literature repositories

Ashish V Tendulkar

doi:10.5555/2591338.2591346

Abstract

There is an increasing interest in the development of biomedical text mining applications not only to enable improved literature search, but also to automatically detect pointers between biologically relevant entities described in articles and their corresponding records in existing annotation databases. The rapid growth of natural language data in biomedical sciences (including scientific articles, patents, patient records, database textual descriptions) together with the practical relevance of these resources for the design, interpretation and evaluation of bioinformatics and experimental research resulted in the implementation of a considerable number of new applications. For the development and maintenance of manually annotated database, text mining assisted literature duration has been especially promising, as well as for the construction of gold standard datasets and gene lists in the context of Systems Biology and gene set enrichment. Attempts have been made also to integrate text mining with other bioinformatics data such as sequence, structural and gene expression information.We plan to focus primarily on applications of text mining and issues in building text mining systems. We will begin with gentle introduction to text mining and its application in various Biology and Bioinformatics related domains. Existing resources for building text mining applications will be presented in terms of (1) useful data collections, (2) lexical resources, (3) features of natural language data that can be exploited by text mining systems and (4) data mining and natural language processing systems. Also the main types of currently available text mining applications will be discussed, including the retrieval and classification of articles, the identification of mentions of biological entities such as genes, proteins and cell types and the extraction of functional descriptions or protein interaction. The use of literature for knowledge discovery and hypothesis generation will be described. A crucial aspect of literature mining systems is evaluation and usability; these two aspects will be covered trough recent community evaluation efforts such as the BioCreative challenge and the BioCreative metaserver initiative. In order to show what kind of queries and results are currently supported by text mining and information extraction systems, practical example cases will be illustrated in detail, complementing the previously introduced basic descriptions of the underlying methodology. Finally a practical case study will show the step by step implementation of a text mining system illustrating how it is possible to construct such a system for a particular information need.After the tutorial, the participants should be aware of the importance of the biomedical literature as a central data and information source for biology and bioinformatics. They should be able to understand how existing text mining systems work and on what features they rely. Participants would have an overview of currently available tools and how to construct such an application in practice.

Full Text