Abstract

Literature-Based Discovery (LBD) aims to connect scientists across silos by assembling models of the literature to reveal previously hidden connections. Unfortunately, LBD systems have been unable to achieve user adoption on a large scale. This work develops opens source software in Python to convert a database of semantic predications of all of PubMed's 27.9 million indexed abstracts into a semantic inference network and biomedical concept graph in Neo4j. The developed software, called SemNet, queries a modified version of the publicly available SemMedDB and computes feature vectors on source-target pairs. Each unique United Medical Language System (UMLS) concept is represented as a node and each predication as an edge. Each node is assigned one of 132 node labels (e.g., Amino Acid, Peptide, or Protein (AAPP); Gene or Genome (GG); etc.) and each edge is labeled with one of 58 predications (e.g. treats, causes, inhibits, etc.). SemNet computes a single feature value for each metapath, or sequence of node types, between a source node and user-specified target node(s). Several different types of metapath-based features (count, degree weighted path count, and HeteSim metric) are computed and vectorized. SemNet employs an unsupervised learning algorithm for rank aggregation (ULARA) to rank identified source nodes that are most relevant to the user-specified target nodes(s). Statistical analysis of correlation among identified source nodes or resultant literature network features are used to identify patterns that can guide future research. Analysis of high residual nodes is used to compare and contrast SemNet rankings between different targets of interest. An example SemNet use case is presented to assess “the differential impact of smoking on cognition in males and females” using the following target nodes: nicotine, learning, memory, tetrahydrocannabinol (THC), cigarette smoke, X chromosome, and Y chromosome. Detailed rankings are discussed. Overall results suggest a hypothesis where smoking negatively impacts cognition to a greater extent in females, but smoking has stronger cardiovascular impacts in males. In summary, SemNet provides an adoptable method for efficient LBD of PubMed that extends beyond omics-only relationships to true multi-scalar connections that can provide actionable insight for predictive medicine, research prioritization, and clinical care.

Highlights

  • Biomedical literature represents an ever-growing repository of complex and interrelated knowledge

  • We utilize the publicly available SemMedDB and the United Medical Language (UMLS) Metathesaurus, which is explained in more detail below, to convert the raw abstract text into a set of shared categories and relationships

  • This section includes: a basic walk-through of generalized SemNet results and performance with discussion on how to visualize and optimize SemNet analyses; a detailed example of insight gained for a specific use case for a research question examining “how cigarette smoke or THC differentially impacts learning or memory in males and females”; a discussion of other general uses for SemNet; and limitations and future directions for SemNet

Read more

Summary

Introduction

Biomedical literature represents an ever-growing repository of complex and interrelated knowledge. Even with the great power of user-specified PubMed searches, it is difficult for a scientist or clinician to keep up with literature in their specialty niche, much less understand the thousands of articles inter-connected to their general domain. The National Library of Medicine has argued that better knowledge management tools have the potential to impact the efficacy of biomedical research at the level of researchers, policymakers, and scientific publishers (Kilicoglu, 2017). The open-source technology developed here, SemNet, makes PubMed relationship literature mining adoptable by a much greater audience of scientists or clinicians that desire to leverage the power of literature mining to guide their research and development efforts

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call