Abstract

Active learning also known as an optimal experimental design, is a process for building a classifier or learning model with less number of training instances in the semi-supervised setting. It's a well-known approach that is used in many real-life machine learning and data mining applications. Active learning uses a query function and an oracle or expert (e.g., a human or information source) for labeling unlabeled data instances to boost up the performance of a classifier. Labeling the unlabeled data instances is difficult, time-consuming, and expensive. In this paper, we have proposed an approach based on cluster analysis for selecting informative training instances from large number of unlabeled data instances or big data that helps us to select less number of training instances to build a classifier suitable for active learning. The proposed method clusters the unlabeled big data into several clusters and find the informative instances from each cluster based on the center of the cluster, nearest neighbors of the center of the cluster, and also selecting random instances from each cluster. The objective is to find the informative unlabeled instances and label them by the oracle for scaling up the classification results of the machine learning algorithms to be applied on big data. We have tested the performance of the proposed method on seven benchmark datasets from UC Irvine Machine Learning Repository employing following five well-known machine learning algorithms: C4.5 (decision tree induction), SVM (support vector machines), Random Forest, Bagging, and Boosting (AdaBoost). The experimental analysis proved that proposed method improves the performance of classifiers in active learning with less number of training instances.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.