Abstract

Machine learning (ML) systems are widely used for automatic entity recognition in pharmacovigilance. Publicly available datasets do not allow the use of annotated entities independently, focusing on small entity subsets or on single language registers (informal or scientific language). The objective of the current study was to create a dataset that enables independent usage of entities, explores the performance of predictive ML models on different registers, and introduces a method to investigate entity cut-off performance. A dataset has been created combining different registers with 18 different entities. We applied this dataset to compare the performance of integrated models with models created with single language registers only. We introduced fractional stratified k-fold cross-validation to determine model performance on entity level by using training dataset fractions. We investigated the course of entity performance with fractions of training datasets and evaluated entity peak and cut-off performance. The dataset combines 1400 records (scientific language: 790; informal language: 610) with 2622 sentences and 9989 entity occurrences and combines data from external (801 records) and internal sources (599 records). We demonstrated that single language register models underperform compared to integrated models trained with multiple language registers. A manually annotated dataset with a variety of different pharmaceutical and biomedical entities was created and is made available to the research community. Our results show that models that combine different registers provide better maintainability, have higher robustness, and have similar or higher performance. Fractional stratified k-fold cross-validation allows the evaluation of training data sufficiency on the entity level.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call