Abstract

In practice, supervised learning algorithms require fully labeled datasets to achieve the high accuracy demanded by current modern applications. However, in industrial settings supervised learning algorithms can perform poorly because of few labeled instances. Semi-supervised learning (SSL) is an automatic labeling approach that utilizes complete labels to infer missing labels in partially complete datasets. The high number of available SSL algorithms and the lack of systematic comparison between them leaves practitioners without guidelines to select the appropriate one for their application. Moreover, each SSL algorithm is often validated and evaluated in a small number of common datasets. However, there is no research that examines what datasets are suitable for comparing different SSL algorihtms. The purpose of this paper is to empirically evaluate the suitability of the datasets commonly used to evaluate and compare different SSL algorithms. We performed a simulation study using twelve datasets of three different datatypes (numerical, text, image) on thirteen different SSL algorithms. The contributions of this paper are two-fold. First, we propose the use of Bayesian congeneric item response theory model to assess the suitability of commonly used datasets. Second, we compare the different SSL algorithms using these datasets. The results show that with except of three datasets, the others have very low discrimination factors and are easily solved by the current algorithms. Additionally, the SSL algorithms have overlapping 90% credible intervals, indicating uncertainty in the difference between the accuracy of these SSL models. The paper concludes suggesting that researchers and practitioners should better consider the choice of datasets used for comparing SSL algorithms.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.