Abstract

Active learning is a kind of machine learning algorithms that spontaneously choose data samples from which they will learn. It has been widely used in many data mining fields such as text classification, in which large amounts of unlabelled data samples are available, but labels are hard to get. In this paper, an improved active learning algorithm is proposed, which takes advantages of the distribution feature of the datasets to reduce the labelling cost and increase the accuracy. Before the active learning process, spectral clustering algorithm is applied to divide the datasets into two categories, and instances located at the boundary of two categories are labelled to train the initial classifier. In order to reduce the calculation cost, an incremental method is added in the present algorithm. The algorithm is applied to several text classification problems. The results show it is more effective and more accurate than the traditional active learning algorithm.

Highlights

  • In the last few years, active learning [1] has become more and more popular because of its effectiveness, especially when dealing with the kind of learning tasks where class labels of each data sample are difficult to get and unlabeled data are sufficient or easy to collect

  • By applying active learning algorithms, the most informative samples are selected in order to learning the correct classifier with less labeled data samples

  • Before the start of active learning, the whole datasets is clustered into two categories, and the instances located on the border of the two categories are picked to be the initial support vectors, and during the learning process, the points closest to the hyper plane will be chosen to be the new instance of the training set

Read more

Summary

Introduction

In the last few years, active learning [1] has become more and more popular because of its effectiveness, especially when dealing with the kind of learning tasks where class labels of each data sample are difficult to get and unlabeled data are sufficient or easy to collect. If the current hyper-plane lies far away from the optimal one, the instances selected according to the current model will be useless for updating of the model and getting the correct hyper-plane This cost dues ignoring the distribution feature of the training data. Before the start of active learning, the whole datasets is clustered into two categories, and the instances located on the border of the two categories are picked to be the initial support vectors, and during the learning process, the points closest to the hyper plane will be chosen to be the new instance of the training set The effect of this algorithm is show in the results of applying it to several text classification problems

Active learning based on spectral clustering
Support vector machine
Application to text classification
Findings
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call