Abstract

One of the major aspects affecting the performance of the classification algorithms is the amount of labeled data which is available during the training phase. It is widely accepted that the labeling procedure of vast amounts of data is both expensive and time-consuming since it requires the employment of human expertise. For a wide variety of scientific fields, unlabeled examples are easy to collect but hard to handle in a useful manner, thus improving the contained information for a subject dataset. In this context, a variety of learning methods have been studied in the literature aiming to efficiently utilize the vast amounts of unlabeled data during the learning process. The most common approaches tackle problems of this kind by individually applying active learning or semi-supervised learning methods. In this work, a combination of active learning and semi-supervised learning methods is proposed, under a common self-training scheme, in order to efficiently utilize the available unlabeled data. The effective and robust metrics of the entropy and the distribution of probabilities of the unlabeled set, to select the most sufficient unlabeled examples for the augmentation of the initial labeled set, are used. The superiority of the proposed scheme is validated by comparing it against the base approaches of supervised, semi-supervised, and active learning in the wide range of fifty-five benchmark datasets.

Highlights

  • The most common approach established in machine learning (ML) is supervised learning (SL).Under the SL schemes, classifiers are trained using purely labeled data

  • Fifty-five (55) benchmark datasets were extracted from the UCI repository [14], related to a wide range of classification problems

  • The k parameter was set equal to ten, as it is commonly selected by the majority of the literature

Read more

Summary

Introduction

Under the SL schemes, classifiers are trained using purely labeled data. In contrast with the problem complexity, the performance of such schemes is directly analogous to the amount and the quality of labeled data which are used at the training phase. Many research works [7] exist focusing on techniques with the aim of exploiting the available unlabeled data especially in favor of classification problems. The most common learning methods incorporating such techniques are active learning (AL) and semi-supervised learning (SSL) [8]. Both AL and SSL share an iterative learning nature, making them a perfect fit for constructing more complex combination learning schemes

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.