Abstract

BackgroundText mining has become a useful tool for biologists trying to understand the genetics of diseases. In particular, it can help identify the most interesting candidate genes for a disease for further experimental analysis. Many text mining approaches have been introduced, but the effect of disease-gene identification varies in different text mining models. Thus, the idea of incorporating more text mining models may be beneficial to obtain more refined and accurate knowledge. However, how to effectively combine these models still remains a challenging question in machine learning. In particular, it is a non-trivial issue to guarantee that the integrated model performs better than the best individual model.ResultsWe present a multi-view approach to retrieve biomedical knowledge using different controlled vocabularies. These controlled vocabularies are selected on the basis of nine well-known bio-ontologies and are applied to index the vast amounts of gene-based free-text information available in the MEDLINE repository. The text mining result specified by a vocabulary is considered as a view and the obtained multiple views are integrated by multi-source learning algorithms. We investigate the effect of integration in two fundamental computational disease gene identification tasks: gene prioritization and gene clustering. The performance of the proposed approach is systematically evaluated and compared on real benchmark data sets. In both tasks, the multi-view approach demonstrates significantly better performance than other comparing methods.ConclusionsIn practical research, the relevance of specific vocabulary pertaining to the task is usually unknown. In such case, multi-view text mining is a superior and promising strategy for text-based disease gene identification.

Highlights

  • Text mining has become a useful tool for biologists trying to understand the genetics of diseases

  • The most important resource for biomedical text mining applications is the MEDLINE database developed by the National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM)

  • The Gene Ontology GO [14] provides consistent descriptions of gene and gene-product attributes in the form of three structured controlled vocabularies that each provide a specific angle of view

Read more

Summary

Introduction

Text mining has become a useful tool for biologists trying to understand the genetics of diseases It can help identify the most interesting candidate genes for a disease for further experimental analysis. There was a surge of interests in automatic exploration of the biomedical literature, ranging from modest approaches such as annotating and extracting keywords from biomedical text to more ambitious attempts like Natural Language Processing (NLP), text-based network construction and inference, and so on. These computational efforts effectively help biologists to identify the most likely disease candidates for further experimental validation. To extract relevant information out of the immense amount of data, to retrieve useful high-level knowledge from the information, text mining and machine learning have become indispensable tools in practical research

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call