A Novel Text-Mining Approach for Retrieving Pharmacogenomics Associations From the Literature.

Maria-Theodora Pandi,Peter J Van Der Spek,George P Patrinos,Maria Koromina

doi:10.3389/fphar.2020.602030

Abstract

Text mining in biomedical literature is an emerging field which has already been shown to have a variety of implementations in many research areas, including genetics, personalized medicine, and pharmacogenomics. In this study, we describe a novel text-mining approach for the extraction of pharmacogenomics associations. The code that was used toward this end was implemented using R programming language, either through custom scripts, where needed, or through utilizing functions from existing libraries. Articles (abstracts or full texts) that correspond to a specified query were extracted from PubMed, while concept annotations were derived by PubTator Central. Terms that denote a Mutation or a Gene as well as Chemical compound terms corresponding to drug compounds were normalized and the sentences containing the aforementioned terms were filtered and preprocessed to create appropriate training sets. Finally, after training and adequate hyperparameter tuning, four text classifiers were created and evaluated (FastText, Linear kernel SVMs, XGBoost, Lasso, and Elastic-Net Regularized Generalized Linear Models) with regard to their performance in identifying pharmacogenomics associations. Although further improvements are essential toward proper implementation of this text-mining approach in the clinical practice, our study stands as a comprehensive, simplified, and up-to-date approach for the identification and assessment of research articles enriched in clinically relevant pharmacogenomics relationships. Furthermore, this work highlights a series of challenges concerning the effective application of text mining in biomedical literature, whose resolution could substantially contribute to the further development of this field.

Highlights

Over the span of 10 years, technological achievements and advances have shifted the direction of pharmacogenomics (PGx) research from candidate gene PGx to large-scale PGx studies (Giacomini et al, 2017; Lavertu et al, 2018)
The identification of biomedical entities of interest via PubTator Central resulted in 5,307 papers with unique PMIDs (2,257 or 42.5% of which are available as full texts)
The best hyperparameters for FastText were determined through testing different values for specific parameters and after following thoroughly the provided guidelines

Summary

Introduction

Over the span of 10 years, technological achievements and advances have shifted the direction of pharmacogenomics (PGx) research from candidate gene PGx to large-scale PGx studies (Giacomini et al, 2017; Lavertu et al, 2018). Text-Mining Approach for Pharmacogenomics Associations research and—in most cases—manual curation of thousands of articles. To this end, biomedical text mining can be proven useful in reducing manual efforts of curating important PGx relationships from the available literature. Text mining has become a widely used approach for the identification and extraction of information from unstructured text (Westergaard et al, 2018). In terms of biomedical text mining, PubMed is primarily implemented for this purpose, owing to the easy and fast extraction of information regarding biological entities, such as genes and proteins

Methods

Results

Conclusion