Abstract

Machine learning has emerged as a cost-effective innovation to support systematic literature reviews in human health risk assessments and other contexts. Supervised machine learning approaches rely on a training dataset, a relatively small set of documents with human-annotated labels indicating their topic, to build models that automatically classify a larger set of unclassified documents. “Active” machine learning has been proposed as an approach that limits the cost of creating a training dataset by interactively and sequentially focussing on training only the most informative documents. We simulate active learning using a dataset of approximately 7000 abstracts from the scientific literature related to the chemical arsenic. The dataset was previously annotated by subject matter experts with regard to relevance to two topics relating to toxicology and risk assessment. We examine the performance of alternative sampling approaches to sequentially expanding the training dataset, specifically looking at uncertainty-based sampling and probability-based sampling. We discover that while such active learning methods can potentially reduce training dataset size compared to random sampling, predictions of model performance in active learning are likely to suffer from statistical bias that negates the method’s potential benefits. We discuss approaches and the extent to which the bias resulting from skewed sampling can be compensated. We propose a useful role for active learning in contexts in which the accuracy of model performance metrics is not critical and/or where it is beneficial to rapidly create a class-balanced training dataset.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.