Transformers-sklearn: a toolkit for medical language understanding with transformer-based models

Feihong Yang,Jiao Li,Xuwen Wang,Hetong Ma

doi:10.1186/s12911-021-01459-0

Abstract

BackgroundTransformer is an attention-based architecture proven the state-of-the-art model in natural language processing (NLP). To reduce the difficulty of beginning to use transformer-based models in medical language understanding and expand the capability of the scikit-learn toolkit in deep learning, we proposed an easy to learn Python toolkit named transformers-sklearn. By wrapping the interfaces of transformers in only three functions (i.e., fit, score, and predict), transformers-sklearn combines the advantages of the transformers and scikit-learn toolkits.MethodsIn transformers-sklearn, three Python classes were implemented, namely, BERTologyClassifier for the classification task, BERTologyNERClassifier for the named entity recognition (NER) task, and BERTologyRegressor for the regression task. Each class contains three methods, i.e., fit for fine-tuning transformer-based models with the training dataset, score for evaluating the performance of the fine-tuned model, and predict for predicting the labels of the test dataset. transformers-sklearn is a user-friendly toolkit that (1) Is customizable via a few parameters (e.g., model_name_or_path and model_type), (2) Supports multilingual NLP tasks, and (3) Requires less coding. The input data format is automatically generated by transformers-sklearn with the annotated corpus. Newcomers only need to prepare the dataset. The model framework and training methods are predefined in transformers-sklearn.ResultsWe collected four open-source medical language datasets, including TrialClassification for Chinese medical trial text multi label classification, BC5CDR for English biomedical text name entity recognition, DiabetesNER for Chinese diabetes entity recognition and BIOSSES for English biomedical sentence similarity estimation.In the four medical NLP tasks, the average code size of our script is 45 lines/task, which is one-sixth the size of transformers’ script. The experimental results show that transformers-sklearn based on pretrained BERT models achieved macro F1 scores of 0.8225, 0.8703 and 0.6908, respectively, on the TrialClassification, BC5CDR and DiabetesNER tasks and a Pearson correlation of 0.8260 on the BIOSSES task, which is consistent with the results of transformers.ConclusionsThe proposed toolkit could help newcomers address medical language understanding tasks using the scikit-learn coding style easily. The code and tutorials of transformers-sklearn are available at https://doi.org/10.5281/zenodo.4453803. In future, more medical language understanding tasks will be supported to improve the applications of transformers_sklearn.

Highlights

Transformer is an attention-based architecture proven the state-of-the-art model in natural language processing (NLP)
The experimental results show that transformers-sklearn based on pretrained BERT models achieved macro F1 scores of 0.8225, 0.8703 and 0.6908, respectively, on the TrialClassification, BC5CDR and DiabetesNER tasks and a Pearson correlation of 0.8260 on the BIOSSES task, which is consistent with the results of transformers
The transformers-sklearn toolkit achieved macro F1 scores of 0.8225, 0.8703 and 0.6908 in the TrialClassification, BC5CDR and DiabetesNER tasks, respectively, and a Pearson correlation of 0.8260 in the BIOSSES task, which are consistent with the results of transformers

Summary

Introduction

Transformer is an attention-based architecture proven the state-of-the-art model in natural language processing (NLP). With the development of natural language processing (NLP) technology, transformer-based models have emerged. To effectively utilize these models and evaluate their performance in downstream tasks, a Python library of transformerbased models, namely, transformers [4], has been developed by gathering state-of-the-art general purpose pre-trained models under a unified application program interface (API) together with an ecosystem of libraries. De Vazelhes W et al implemented supervised and weakly supervised distance metric learning algorithms and wrapped them in a Python package named metric-learn [11] These works made scikit-learn more powerful and efficient in specific domain tasks

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Medical Informatics and Decision Making	Publication Date: Jul 1, 2021
Citations: 20	License type: open-access

R Discovery Prime

R Discovery Prime

Transformers-sklearn: a toolkit for medical language understanding with transformer-based models

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Medical Informatics and Decision Making

Lead the way for us

Similar Papers

Automatic Extraction of Comprehensive Drug Safety Information from Adverse Drug Event Narratives in the Korea Adverse Event Reporting System Using Natural Language Processing Techniques.
Siun Kim ... Yoona Choi
Drug Safety | VOL. 46
Siun Kim, et. al.Siun Kim ... Yoona Choi
17 Jun 2023
Drug Safety | VOL. 46

Critical assessment of transformer-based AI models for German clinical notes
Manuel Lentzen ... Vanessa Lage-Rupprecht
JAMIA Open | VOL. 5
Manuel Lentzen, et. al.Manuel Lentzen ... Vanessa Lage-Rupprecht
04 Oct 2022
JAMIA Open | VOL. 5

Automatically Finding Actors in Texts: A Performance Review of Multilingual Named Entity Recognition Tools
Paul Balluff ... Annie Waldherr
Communication Methods and Measures | VOL. ahead-of-print
Paul Balluff, et. al.Paul Balluff ... Annie Waldherr
20 Mar 2024
Communication Methods and Measures | VOL. ahead-of-print

GREEK-BERT: The Greeks visiting Sesame Street
John Koutsikakis ... Ilias Chalkidis
-
John Koutsikakis, et. al.John Koutsikakis ... Ilias Chalkidis
02 Sep 2020
02 Sep 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Transformers-sklearn: a toolkit for medical language understanding with transformer-based models

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Medical Informatics and Decision Making