Bat4RCT: A suite of benchmark data and baseline methods for text classification of randomized controlled trials.

Jenna Kim,Jinmo Kim,Aejin Lee,Jinseok Kim

doi:10.1371/journal.pone.0283342

Abstract

Randomized controlled trials (RCTs) play a major role in aiding biomedical research and practices. To inform this research, the demand for highly accurate retrieval of scientific articles on RCT research has grown in recent decades. However, correctly identifying all published RCTs in a given domain is a non-trivial task, which has motivated computer scientists to develop methods for identifying papers involving RCTs. Although existing studies have provided invaluable insights into how RCT tags can be predicted for biomedicine research articles, they used datasets from different sources in varying sizes and timeframes and their models and findings cannot be compared across studies. In addition, as datasets and code are rarely shared, researchers who conduct RCT classification have to write code from scratch, reinventing the wheel. In this paper, we present Bat4RCT, a suite of data and an integrated method to serve as a strong baseline for RCT classification, which includes the use of BERT-based models in comparison with conventional machine learning techniques. To validate our approach, all models are applied on 500,000 paper records in MEDLINE. The BERT-based models showed consistently higher recall scores than conventional machine learning and CNN models while producing slightly better or similar precision scores. The best performance was achieved by the BioBERT model when trained on both title and abstract texts, with the F1 score of 90.85%. This infrastructure of dataset and code will provide a competitive baseline for the evaluation and comparison of new methods and the convenience of future benchmarking. To our best knowledge, our study is the first work to apply BERT-based language modeling techniques to RCT classification tasks and to share dataset and code in order to promote reproducibility and improvement in text classification in biomedicine research.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: PLOS ONE	Publication Date: Mar 24, 2023
Citations: 2	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Bat4RCT: A suite of benchmark data and baseline methods for text classification of randomized controlled trials.

Abstract

Talk to us

Similar Papers

More From: PLOS ONE

Lead the way for us

Similar Papers

Predict Ionization Energy of Molecules Using Conventional and Graph-Based Machine Learning Models.
Yufeng Liu ... Zhenyu Li
Journal of Chemical Information and Modeling | VOL. 63
Yufeng Liu, et. al.Yufeng Liu ... Zhenyu Li
23 Jan 2023
Journal of Chemical Information and Modeling | VOL. 63

Comparative analysis of conventional and ensemble machine learning models for predicting split tensile strength in thermal stressed SCM-blended lightweight concrete
Saad Shamim Ansari ... Kamran Zafar
Materials Today: Proceedings | VOL. -
Saad Shamim Ansari, et. al.Saad Shamim Ansari ... Kamran Zafar
01 Apr 2024
Materials Today: Proceedings | VOL. -

A novel noise-robust stacked ensemble of deep and conventional machine learning classifiers (NRSE-DCML) for human biometric identification from electrocardiogram signals
Noushin Rabinezhadsadatmahaleh ... Toktam Khatibi
Informatics in Medicine Unlocked | VOL. 21
Noushin Rabinezhadsadatmahaleh, et. al.Noushin Rabinezhadsadatmahaleh ... Toktam Khatibi
01 Jan 2020
Informatics in Medicine Unlocked | VOL. 21

DeepDefense: Identifying DDoS Attack via Deep Learning
Xiaoyong Yuan ... Chuanhuang Li
-
Xiaoyong Yuan, et. al.Xiaoyong Yuan ... Chuanhuang Li
01 May 2017
01 May 2017

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Bat4RCT: A suite of benchmark data and baseline methods for text classification of randomized controlled trials.

Abstract

Talk to us

Similar Papers

More From: PLOS ONE