Abstract

Machine learning-based text classification models require labeled data for training. However, manual labeling is a costly and time-consuming process. This task is particularly difficult in domains such as banking, where outsourcing data labeling is generally not allowed due to privacy laws. We propose a novel active learning-based approach in which the most difficult instances in the pool of unlabeled data are selected based on the Shapley Additive Explanations (SHAP) values of the words in the texts to be classified and passed to human annotators for labeling. At each iteration of this human-in-the-loop strategy, newly labeled instances are added to the training set. We demonstrate the effectiveness of this approach in classifying customer comments in the banking domain surveys. Our experiments indicate that better results are achieved when the proposed approach is used to expand the training set, compared to a baseline strategy of expanding the training set with randomly selected instances. Further analysis shows that the difference in performance between the two approaches becomes more pronounced as class imbalance increases. This study suggests that human-in-the-loop based active learning is a powerful strategy for creating high-quality training datasets by effectively leveraging human annotation effort.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.