Abstract

Recent scientific advances have accumulated a tremendous amount of biomedical knowledge providing novel insights into the relationship between molecular and cellular processes and diseases. Literature mining is one of the commonly used methods to retrieve and extract information from scientific publications for understanding these associations. However, due to large data volume and complicated associations with noises, the interpretability of such association data for semantic knowledge discovery is challenging. In this study, we describe an integrative computational framework aiming to expedite the discovery of latent disease mechanisms by dissecting 146,245 disease-gene associations from over 25 million of PubMed indexed articles. We take advantage of both Latent Dirichlet Allocation (LDA) modeling and network-based analysis for their capabilities of detecting latent associations and reducing noises for large volume data respectively. Our results demonstrate that (1) the LDA-based modeling is able to group similar diseases into disease topics; (2) the disease-specific association networks follow the scale-free network property; (3) certain subnetwork patterns were enriched in the disease-specific association networks; and (4) genes were enriched in topic-specific biological processes. Our approach offers promising opportunities for latent disease-gene knowledge discovery in biomedical research.

Highlights

  • In recent decades, a vast amount of biomedical research has been conducted to investigate disease classifications, health records, clinical trials, and adverse event reports that can be utilized to establish links between disease and genes, in order to identify novel treatments for diseases [1]

  • Our results demonstrate that (1) the Latent Dirichlet Allocation (LDA)-based approach is able to group related diseases into same disease topics based on their high-dimensional yet sparse associations with genes; (2) the disease-specific association network follows the scale-free network property, in which hub nodes are rich in diseases and genes closely related with each other; (3) significant network motif patterns can be detected in the diseasespecific networks indicating novel yet latent disease mechanisms; and (4) genes in the association network are significantly enriched in biological processes and canonical pathways highly involved in hub diseases

  • To address the issues of semantic granularity and inherent noises brought by high-dimensional disease-gene association data mined from literature, we proposed an integrative analytical framework which combines LDA and network analysis to facilitate latent disease-gene association discovery and provide insights into the relationship between molecular and cellular processes and diseases

Read more

Summary

Results

From SemMedDB Version 25, we extracted 146,245 disease-gene associations between 7,039 diseases and 10,921 genes from titles and abstracts of over 25 million PubMed articles. Consistent with the results of overall cosine similarity measurement, the similarities of top 10 topics were higher at disease level (average value is 0.26) than at gene level (average value is 0.146) These observations suggested that there are overlapping genes and diseases among topics, our LDA process was able to generate distinct disease groups based on the disease-gene associations embedded in SemMedDB. Most of the top ranked canonical pathways and networks enriched in these genes have been proven to be associated with lung cancer (S6 File) These results suggested that 1) the lung cancer topic shares similar network properties as other disease-gene association networks, where important diseases were prioritized through network analysis, and 2) genes allocated in the topic were enriched in biological processes that can serve as potential research focuses in lung research. Other significant pathways and genes that are not well known to be associated with lupus can serve as future directions

Discussions and conclusions
Materials and methods
Evaluation of gene similarity
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call