Text Mining Pipeline Research Articles

Biomedical and life science literature is an essential way to publish experimental results. With the rapid growth of the number of new publications, the amount of scientific knowledge represented in free text is increasing remarkably. There has been much interest in developing techniques that can extract this knowledge and make it accessible to aid scientists in discovering new relationships between biological entities and answering biological questions. Making use of the word2vec approach, we generated word vector representations based on a corpus consisting of over 16 million PubMed abstracts. We developed a text mining pipeline to produce word2vec embeddings with different properties and performed validation experiments to assess their utility for biomedical analysis. An important pre-processing step consisted in the substitution of synonymous terms by their preferred terms in biomedical databases. Furthermore, we extracted gene-gene networks from two embedding versions and used them as prior knowledge to train Graph-Convolutional Neural Networks (CNNs) on large breast cancer gene expression data and on other cancer datasets. Performances of resulting models were compared to Graph-CNNs trained with protein-protein interaction (PPI) networks or with networks derived using other word embedding algorithms. We also assessed the effect of corpus size on the variability of word representations. Finally, we created a web service with a graphical and a RESTful interface to extract and explore relations between biomedical terms using annotated embeddings. Comparisons to biological databases showed that relations between entities such as known PPIs, signaling pathways and cellular functions, or narrower disease ontology groups correlated with higher cosine similarity. Graph-CNNs trained with word2vec-embedding-derived networks performed sufficiently good for the metastatic event prediction tasks compared to other networks. Such performance was good enough to validate the utility of our generated word embeddings in constructing biological networks. Word representations as produced by text mining algorithms like word2vec, therefore are able to capture biologically meaningful relations between entities. Our generated embeddings are publicly available at https://github.com/genexplain/Word2vec-based-Networks/blob/main/README.md.

Read full abstract

Methods We used EHR data of patients included in the Second Manifestations of ARTerial disease (SMART) study. We propose a deep learning-based multimodal architecture for our text mining pipeline that integrates neural text representation with preprocessed clinical predictors for the prediction of recurrence of major cardiovascular events in cardiovascular patients. Text preprocessing, including cleaning and stemming, was first applied to filter out the unwanted texts from X-ray radiology reports. Thereafter, text representation methods were used to numerically represent unstructured radiology reports with vectors. Subsequently, these text representation methods were added to prediction models to assess their clinical relevance. In this step, we applied logistic regression, support vector machine (SVM), multilayer perceptron neural network, convolutional neural network, long short-term memory (LSTM), and bidirectional LSTM deep neural network (BiLSTM). Results We performed various experiments to evaluate the added value of the text in the prediction of major cardiovascular events. The two main scenarios were the integration of radiology reports (1) with classical clinical predictors and (2) with only age and sex in the case of unavailable clinical predictors. In total, data of 5603 patients were used with 5-fold cross-validation to train the models. In the first scenario, the multimodal BiLSTM (MI-BiLSTM) model achieved an area under the curve (AUC) of 84.7%, misclassification rate of 14.3%, and F1 score of 83.8%. In this scenario, the SVM model, trained on clinical variables and bag-of-words representation, achieved the lowest misclassification rate of 12.2%. In the case of unavailable clinical predictors, the MI-BiLSTM model trained on radiology reports and demographic (age and sex) variables reached an AUC, F1 score, and misclassification rate of 74.5%, 70.8%, and 20.4%, respectively. Conclusions Using the case study of routine care chest X-ray radiology reports, we demonstrated the clinical relevance of integrating text features and classical predictors in our text mining pipeline for cardiovascular risk prediction. The MI-BiLSTM model with word embedding representation appeared to have a desirable performance when trained on text data integrated with the clinical variables from the SMART study. Our results mined from chest X-ray reports showed that models using text data in addition to laboratory values outperform those using only known clinical predictors.

Read full abstract

Text Mining Pipeline Research Articles

Related Topics

Articles published on Text Mining Pipeline

Bioregulatory event extraction using large language models: a case study of rice literature

Searching for LINCS to Stress: Using Text Mining to Automate Reference Chemical Curation.

Comorbidity-Guided Text Mining and Omics Pipeline to Identify Candidate Genes and Drugs for Alzheimer's Disease.

Cancer-Alterome: a literature-mined resource for regulatory events caused by genetic alterations in cancer

Advancing Italian biomedical information extraction with transformers-based models: Methodological insights and multicenter practical application

DEBBIE: The Open Access Database of Experimental Scaffolds and Biomaterials Built Using an Automated Text Mining Pipeline.

DETEXA: declarative extensible text exploration and analysis through SQL

Neuroimaging-ITM: A Text Mining Pipeline Combining Deep Adversarial Learning with Interaction Based Topic Modeling for Enabling the FAIR Neuroimaging Study.

Automated extraction of genes associated with antibiotic resistance from the biomedical literature.

ExTRI: Extraction of transcription regulation interactions from literature

Contextualizing Genes by Using Text-Mined Co-Occurrence Features for Cancer Gene Panel Discovery.

Text mining-based word representations for biomedical data analysis and protein-protein interaction networks in machine learning tasks.

COVID-19 recommender system based on an annotated multilingual corpus.

CT Angiography Clot Burden Score from Data Mining of Structured Reports for Pulmonary Embolism.

Synthetic Biology Knowledge System.

Text mining of gene\u2013phenotype associations reveals new phenotypic profiles of autism-associated genes

Automatic Prediction of Recurrence of Major Cardiovascular Events: A Text Mining Study Using Chest X-Ray Reports.

A fast, accurate, and generalisable heuristic-based negation detection algorithm for clinical text

Comparison of rule-based and neural network models for negation detection in radiology reports

Linking chemical and disease entities to ontologies by integrating PageRank with extracted relations from literature

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Text Mining Pipeline Research Articles

Related Topics

Articles published on Text Mining Pipeline

Bioregulatory event extraction using large language models: a case study of rice literature

Searching for LINCS to Stress: Using Text Mining to Automate Reference Chemical Curation.

Comorbidity-Guided Text Mining and Omics Pipeline to Identify Candidate Genes and Drugs for Alzheimer's Disease.

Cancer-Alterome: a literature-mined resource for regulatory events caused by genetic alterations in cancer

Advancing Italian biomedical information extraction with transformers-based models: Methodological insights and multicenter practical application

DEBBIE: The Open Access Database of Experimental Scaffolds and Biomaterials Built Using an Automated Text Mining Pipeline.

DETEXA: declarative extensible text exploration and analysis through SQL

Neuroimaging-ITM: A Text Mining Pipeline Combining Deep Adversarial Learning with Interaction Based Topic Modeling for Enabling the FAIR Neuroimaging Study.

Automated extraction of genes associated with antibiotic resistance from the biomedical literature.

ExTRI: Extraction of transcription regulation interactions from literature

Contextualizing Genes by Using Text-Mined Co-Occurrence Features for Cancer Gene Panel Discovery.

Text mining-based word representations for biomedical data analysis and protein-protein interaction networks in machine learning tasks.

COVID-19 recommender system based on an annotated multilingual corpus.

CT Angiography Clot Burden Score from Data Mining of Structured Reports for Pulmonary Embolism.

Synthetic Biology Knowledge System.

Text mining of gene\u2013phenotype associations reveals new phenotypic profiles of autism-associated genes

Automatic Prediction of Recurrence of Major Cardiovascular Events: A Text Mining Study Using Chest X-Ray Reports.

A fast, accurate, and generalisable heuristic-based negation detection algorithm for clinical text

Comparison of rule-based and neural network models for negation detection in radiology reports

Linking chemical and disease entities to ontologies by integrating PageRank with extracted relations from literature