Labeling News Article’s Subject Using Uncertainty Based Active Learning

Meet Parekh,Yash Patel

doi:10.1007/978-3-030-76063-2_15

Abstract

In Natural Language Processing, labeling a text corpus is often an expensive task that requires a lot of human efforts and cost. Whereas unlabeled text corpora in varying domains are readily available. For a couple of decades, research efforts have concentrated on algorithms that can be used for labeling the corpus, thus minimizing the number of articles required to be labeled manually. Semi-Supervised Learning and Active Learning have been a great promise for labeling the articles using a trained model. Also, Semi-Supervised learning algorithms and Active learning algorithms have strong theoretical guarantees. This study aims to tag 1183 articles from The New York Times and The Wall Street Journal with the subject (i.e. primary organization related to news articles) employing Active Learning algorithm. We used Active Learning algorithm which uses Random Sampling along with Uncertainty Based Querying. This Active Learning approach is used to train Naive Bayes classifier using Bag of Words features. This classifier is used to tag 1183 articles of which only 167 required manual review, thus achieving reduction of 85.89% with 78.18% accuracy. Also, for verifying quality of labeled corpus, SVM classifier using same features was trained on labeled corpus giving accuracy of 74.45% on test data.

Full Text