Abstract
Due to datasets have continuously grown, efforts have been performed in the attempt to solve the problem related to the large amount of unlabeled data in disproportion to the scarcity of labeled data. Another important issue is related to the trade-off between the difficulty in obtaining annotations provided by a specialist and the need for a significant amount of annotated data to obtain a robust classifier. In this context, active learning techniques jointly with semi-supervised learning are interesting. A smaller number of more informative samples previously selected (by the active learning strategy) and labeled by a specialist can propagate the labels to a set of unlabeled data (through the semi-supervised one). However, most of the literature works neglect the need for interactive response times that can be required by certain real applications. We propose a more effective and efficient active semi-supervised learning framework, including a new active learning method. An extensive experimental evaluation was performed in the biological context (using the ALL-AML, Escherichia coli and PlantLeaves II datasets), comparing our proposals with state-of-the-art literature works and different supervised (SVM, RF, OPF) and semi-supervised (YATSI-SVM, YATSI-RF and YATSI-OPF) classifiers. From the obtained results, we can observe the benefits of our framework, which allows the classifier to achieve higher accuracies more quickly with a reduced number of annotated samples. Moreover, the selection criterion adopted by our active learning method, based on diversity and uncertainty, enables the prioritization of the most informative boundary samples for the learning process. We obtained a gain of up to 20% against other learning techniques. The active semi-supervised learning approaches presented a better trade-off (accuracies and competitive and viable computational times) when compared with the active supervised learning ones.
Highlights
The amount of information available has been increasing, due to new means of acquisition, increased storage capacity and speed of communication, producing large datasets
This paper proposes a more effective and efficient learning approach to cope with: i) a higher proportion of unlabeled data; ii) scarcity of labeled data; iii) the need for a significant amount of data labeled by a specialist to obtain high accuracies by the classifiers; iv) difficulty in obtaining annotations made by a specialist; v) the need for interactive response times for the learning process
We evaluate the performance of the classifiers, with the use of active learning strategies, performing comparisons between the selection strategies (Rand, Cluster Rand (Clu), Increasing Boundary Edges (IBE), Root Distance-based Sampling (RDS)) described in Section 1.2 and our proposed Root Distance Boundary Sampling (RDBS) selection strategy
Summary
The amount of information available has been increasing, due to new means of acquisition, increased storage capacity and speed of communication, producing large datasets. By combining active learning (AL) and semi-supervised learning (SSL) techniques, it would be possible to select the most significant samples from the dataset. They enable to compose the labeled training set and propagate their labels to the unlabeled training set, constructing a more robust classifier. A smaller number of more informative samples previously selected (by our active learning strategy) and labeled by the specialist can more effectively (i.e. with fewer errors) propagate the labels to a set of unlabeled data (through the semi-supervised strategy). We do not need that the specialist spends time and effort to label a large dataset
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.