Text mining-based word representations for biomedical data analysis and protein-protein interaction networks in machine learning tasks.

Halima Alachram,Hryhorii Chereda,Edgar Wingender,Tim Beißbarth,Philip Stegmaier,Khanh N.Q Le

doi:10.1371/journal.pone.0258623

Halima Alachram, Hryhorii Chereda + Show 4 more

Open Access

https://doi.org/10.1371/journal.pone.0258623

Copy DOI

Journal: PLOS ONE	Publication Date: Oct 15, 2021
Citations: 9	License type: CC BY 4.0

Affiliation: Universitätsmedizin Göttingen, GeneXplain (Germany)

Abstract

Biomedical and life science literature is an essential way to publish experimental results. With the rapid growth of the number of new publications, the amount of scientific knowledge represented in free text is increasing remarkably. There has been much interest in developing techniques that can extract this knowledge and make it accessible to aid scientists in discovering new relationships between biological entities and answering biological questions. Making use of the word2vec approach, we generated word vector representations based on a corpus consisting of over 16 million PubMed abstracts. We developed a text mining pipeline to produce word2vec embeddings with different properties and performed validation experiments to assess their utility for biomedical analysis. An important pre-processing step consisted in the substitution of synonymous terms by their preferred terms in biomedical databases. Furthermore, we extracted gene-gene networks from two embedding versions and used them as prior knowledge to train Graph-Convolutional Neural Networks (CNNs) on large breast cancer gene expression data and on other cancer datasets. Performances of resulting models were compared to Graph-CNNs trained with protein-protein interaction (PPI) networks or with networks derived using other word embedding algorithms. We also assessed the effect of corpus size on the variability of word representations. Finally, we created a web service with a graphical and a RESTful interface to extract and explore relations between biomedical terms using annotated embeddings. Comparisons to biological databases showed that relations between entities such as known PPIs, signaling pathways and cellular functions, or narrower disease ontology groups correlated with higher cosine similarity. Graph-CNNs trained with word2vec-embedding-derived networks performed sufficiently good for the metastatic event prediction tasks compared to other networks. Such performance was good enough to validate the utility of our generated word embeddings in constructing biological networks. Word representations as produced by text mining algorithms like word2vec, therefore are able to capture biologically meaningful relations between entities. Our generated embeddings are publicly available at https://github.com/genexplain/Word2vec-based-Networks/blob/main/README.md.

Highlights

The field of Natural Language Processing (NLP) is concerned with the development of methods and algorithms to computationally analyze and process human natural language
Comparisons showed that relations between known protein-protein interaction (PPI), common pathways and cellular functions, or narrower disease ontology groups correlated with higher vector cosine similarity
Translated our findings indicate that the performance obtained by Graph-Convolutional Neural Networks (CNNs) is sufficiently good to judge the utility of word2vec-embedding in creating gene-gene networks for machine learning tasks

Summary

Introduction

The field of Natural Language Processing (NLP) is concerned with the development of methods and algorithms to computationally analyze and process human natural language. Solutions in this domain often have practical significance for everyday applications such as conversion between written and spoken language to enhance media accessibility, translation between linguae, optical character recognition (OCR) for street sign detection in traffic assistance systems, or document/media content classification for recommendation systems. A novel approach was recently introduced that applied neural networks (NNs) to learn high-dimensional vector representations of words in a text corpus that preserve their syntactic and semantic relationships [5]. As produced by word2vec, word embedding allows computing relations between words obtained from a large unlabeled corpus, e.g., using their vector cosine similarity

Objectives

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Text mining-based word representations for biomedical data analysis and protein-protein interaction networks in machine learning tasks.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLOS ONE

Lead the way for us

Similar Papers

Text mining-based word representations for biomedical data analysis and protein-protein interaction networks in machine learning tasks
Edgar Wingender ... Philip Stegmaier
-
Edgar Wingender, et. al.Edgar Wingender ... Philip Stegmaier
15 Oct 2021
15 Oct 2021

Knowledge Integration and Representation for Biomedical Analysis
Halima Alachram
-
Halima AlachramHalima Alachram
21 Feb 2022
21 Feb 2022

An uncertain model-based approach for identifying dynamic protein complexes in uncertain protein-protein interaction networks
Yijia Zhang ... Yiwei Liu
BMC Genomics | VOL. 18
Yijia Zhang, et. al.Yijia Zhang ... Yiwei Liu
01 Oct 2017
BMC Genomics | VOL. 18

Protein complex detection in PPI networks based on data integration and supervised learning method.
Feng Ying Yu ... Xiao Hua Hu
BMC Bioinformatics | VOL. Suppl 16 12
Feng Ying Yu, et. al.Feng Ying Yu ... Xiao Hua Hu
25 Aug 2015
BMC Bioinformatics | VOL. Suppl 16 12

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Text mining-based word representations for biomedical data analysis and protein-protein interaction networks in machine learning tasks.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLOS ONE