STonKGs: a sophisticated transformer trained on biomedical text and knowledge graphs.

Helena Balabin,John Bachman,Benjamin M Gyori,Alpha Tom Kodamullil,Paul G Plöger,Colin Birkenbihl,Daniel Domingo-Fernández,Charles Tapley Hoyt,Martin Hofmann-Apitius,Jonathan Wren

doi:10.1093/bioinformatics/btac001

Abstract

MotivationThe majority of biomedical knowledge is stored in structured databases or as unstructured text in scientific publications. This vast amount of information has led to numerous machine learning-based biological applications using either text through natural language processing (NLP) or structured data through knowledge graph embedding models. However, representations based on a single modality are inherently limited.ResultsTo generate better representations of biological knowledge, we propose STonKGs, a Sophisticated Transformer trained on biomedical text and Knowledge Graphs (KGs). This multimodal Transformer uses combined input sequences of structured information from KGs and unstructured text data from biomedical literature to learn joint representations in a shared embedding space. First, we pre-trained STonKGs on a knowledge base assembled by the Integrated Network and Dynamical Reasoning Assembler consisting of millions of text-triple pairs extracted from biomedical literature by multiple NLP systems. Then, we benchmarked STonKGs against three baseline models trained on either one of the modalities (i.e. text or KG) across eight different classification tasks, each corresponding to a different biological application. Our results demonstrate that STonKGs outperforms both baselines, especially on the more challenging tasks with respect to the number of classes, improving upon the F1-score of the best baseline by up to 0.084 (i.e. from 0.881 to 0.965). Finally, our pre-trained model as well as the model architecture can be adapted to various other transfer learning applications.Availability and implementationWe make the source code and the Python package of STonKGs available at GitHub (https://github.com/stonkgs/stonkgs) and PyPI (https://pypi.org/project/stonkgs/). The pre-trained STonKGs models and the task-specific classification models are respectively available at https://huggingface.co/stonkgs/stonkgs-150k and https://zenodo.org/communities/stonkgs.Supplementary information Supplementary data are available at Bioinformatics online.

Highlights

In recent years, the availability of biomedical data has increased drastically (Dash et al, 2019)
We pre-trained STonKGs on a knowledge base assembled by the Integrated Network and Dynamical Reasoning Assembler consisting of millions of text-triple pairs extracted from biomedical literature by multiple natural language processing (NLP) systems
The NLP-baseline seemed more suited for these tasks compared with the Knowledge Graphs (KGs)-baseline, since the information could explicitly be stated in the evidence itself

Summary

Introduction

The availability of biomedical data has increased drastically (Dash et al, 2019). Such data originate from a vast collection of modalities such as high-throughput experiments, clinical text documents as well as cell-based and biochemical assay data. The information derived from research carried out on those data is commonly stored in two distinct forms: (i) as unstructured free text in scientific publications, and (ii) in condensed, structured biomedical networks. The biology represented in the literature strongly depends on the different contexts that it occurs in. To exploit the biomedical knowledge stored in both structured and unstructured formats, it is crucial to study each relation in the relevant context it was observed in.

Methods

Results

Conclusion