Content-rich biological network constructed by mining PubMed abstracts

Hao Chen,Burt M Sharp

doi:10.1186/1471-2105-5-147

Hao Chen, Burt M Sharp

Open Access

https://doi.org/10.1186/1471-2105-5-147

Copy DOI

Abstract

BackgroundThe integration of the rapidly expanding corpus of information about the genome, transcriptome, and proteome, engendered by powerful technological advances, such as microarrays, and the availability of genomic sequence from multiple species, challenges the grasp and comprehension of the scientific community. Despite the existence of text-mining methods that identify biological relationships based on the textual co-occurrence of gene/protein terms or similarities in abstract texts, knowledge of the underlying molecular connections on a large scale, which is prerequisite to understanding novel biological processes, lags far behind the accumulation of data. While computationally efficient, the co-occurrence-based approaches fail to characterize (e.g., inhibition or stimulation, directionality) biological interactions. Programs with natural language processing (NLP) capability have been created to address these limitations, however, they are in general not readily accessible to the public.ResultsWe present a NLP-based text-mining approach, Chilibot, which constructs content-rich relationship networks among biological concepts, genes, proteins, or drugs. Amongst its features, suggestions for new hypotheses can be generated. Lastly, we provide evidence that the connectivity of molecular networks extracted from the biological literature follows the power-law distribution, indicating scale-free topologies consistent with the results of previous experimental analyses.ConclusionsChilibot distills scientific relationships from knowledge available throughout a wide range of biological domains and presents these in a content-rich graphical format, thus integrating general biomedical knowledge with the specialized knowledge and interests of the user. Chilibot can be accessed free of charge to academic users.

Highlights

The integration of the rapidly expanding corpus of information about the genome, transcriptome, and proteome, engendered by powerful technological advances, such as microarrays, and the availability of genomic sequence from multiple species, challenges the grasp and comprehension of the scientific community
We present a text mining approach, Chilibot, which constructs content-rich relationship networks between genes, proteins, drugs and biological concepts based on linguistic analysis of relevant records stored in the PubMed literature database
Design and implementation The overall goal of Chilibot is to generate graphical representations of the relationships among user provided terms. This is achieved by automatically querying the PubMed literature database and extracting information using natural language processing (NLP) techniques

Summary

Results

We present a NLP-based text-mining approach, Chilibot, which constructs content-rich relationship networks among biological concepts, genes, proteins, or drugs. We provide evidence that the connectivity of molecular networks extracted from the biological literature follows the power-law distribution, indicating scale-free topologies consistent with the results of previous experimental analyses

Conclusions

Background

Results and discussion

B Number of abstracts selected for retrieval

Conclusion

Methods