Abstract

The embedding of Medical Subject Headings (MeSH) terms has become a foundation for many downstream bioinformatics tasks. Recent studies employ different data sources, such as the corpus (in which each document is indexed by a set of MeSH terms), the MeSH term ontology, and the semantic predications between MeSH terms (extracted by SemMedDB), to learn their embeddings. While these data sources contribute to learning the MeSH term embeddings, current approaches fail to incorporate all of them in the learning process. The challenge is that the structured relationships between MeSH terms are different across the data sources, and there is no approach to fusing such complex data into the MeSH term embedding learning. In this paper, we study the problem of incorporating corpus, ontology, and semantic predications to learn the embeddings of MeSH terms. We propose a novel framework, Corpus, Ontology, and Semantic predications-based MeSH term embedding (COS), to generate high-quality MeSH term embeddings. COS converts the corpus, ontology, and semantic predications into MeSH term sequences, merges these sequences, and learns MeSH term embeddings using the sequences. Extensive experiments on different datasets show that COS outperforms various baseline embeddings and traditional non-embedding-based baselines.

Highlights

  • Neural-based approaches have shown great success in bioinformatics applications, such as drug re-purposing and Literature-Based Discovery (LBD) [1,2,3]

  • Given that heart disease is a type of cardiovascular disease, and fish oil can relieve heart disease, good embeddings of such terms can help indicate that fish oil may relieve other cardiovascular diseases and advance the biomedical research

  • The data sources can be categorized into three types: 1) the PubMed corpus, each document contains a set of Medical Subject Heading (MeSH) terms describing its content; 2) the MeSH term ontology in directed acyclic graph (DAG) structure that is defined and maintained by the National Library of Medicine (NLM); 3) semantic predications extracted by SemMedDB, i.e., subjectpredicate-object triples in SemMedDB where the subject and object are biomedical terms and the predicate is a semantic relationship

Read more

Summary

Introduction

Neural-based approaches have shown great success in bioinformatics applications, such as drug re-purposing and Literature-Based Discovery (LBD) [1,2,3]. The data sources can be categorized into three types: 1) the PubMed corpus, each document contains a set of MeSH terms describing its content; 2) the MeSH term ontology in DAG structure that is defined and maintained by the National Library of Medicine (NLM); 3) semantic predications extracted by SemMedDB, i.e., subjectpredicate-object triples in SemMedDB where the subject and object are biomedical terms and the predicate is a semantic relationship These approaches have shown that high-quality MeSH embeddings can effectively improve the performance of many downstream tasks. We propose Corpus, Ontology, and semantic predications-based MeSH term embedding (COS) to model all the three data sources in the embedding learning. We propose COS that incorporates the corpus, ontology, and semantic predications in MeSH term embedding learning, which is the first solution merging all the three data sources to the best of our knowledge. We will make our pre-processed datasets and the COS source code publicly available

Methodology
Experiments
Method
Findings
Conclusions and future directions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call