Abstract

BackgroundRepresentation learning provides new and powerful graph analytical approaches and tools for the highly valued data science challenge of mining knowledge graphs. Since previous graph analytical methods have mostly focused on homogeneous graphs, an important current challenge is extending this methodology for richly heterogeneous graphs and knowledge domains. The biomedical sciences are such a domain, reflecting the complexity of biology, with entities such as genes, proteins, drugs, diseases, and phenotypes, and relationships such as gene co-expression, biochemical regulation, and biomolecular inhibition or activation. Therefore, the semantics of edges and nodes are critical for representation learning and knowledge discovery in real world biomedical problems.ResultsIn this paper, we propose the edge2vec model, which represents graphs considering edge semantics. An edge-type transition matrix is trained by an Expectation-Maximization approach, and a stochastic gradient descent model is employed to learn node embedding on a heterogeneous graph via the trained transition matrix. edge2vec is validated on three biomedical domain tasks: biomedical entity classification, compound-gene bioactivity prediction, and biomedical information retrieval. Results show that by considering edge-types into node embedding learning in heterogeneous graphs, edge2vec significantly outperforms state-of-the-art models on all three tasks.ConclusionsWe propose this method for its added value relative to existing graph analytical methodology, and in the real world context of biomedical knowledge discovery applicability.

Highlights

  • Representation learning provides new and powerful graph analytical approaches and tools for the highly valued data science challenge of mining knowledge graphs

  • These models were designed for homogeneous networks, which means that they do not explicitly encode information related to the types of nodes and edges in a heterogeneous network

  • We develop an EM model to train a transition matrix via random walks on a heterogeneous graph as a unified framework and employ a stochastic gradient descent (SGD) method to learn node embedding in an efficient manner

Read more

Summary

Introduction

Representation learning provides new and powerful graph analytical approaches and tools for the highly valued data science challenge of mining knowledge graphs. This approach has several drawbacks: 1) domain knowledge is required to define metapaths and those mentioned in [7] are symmetric paths which are unrealistic in many applications; 2) metapath2vec does not consider edge types rather only node types; and 3) metapath2vec can only consider one metapath at one time to generate random walk, it cannot consider all the metapaths at the same time during random walk On another related track, which might be termed biomedical data science (BMDS), previous work has employed KG embedding and ML methodology with the focus on applicability and applications such as compound target bioactivity [8, 9] and disease-associated gene prioritization [10].

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call