Using Unlabeled Data to Improve Inductive Models by Incorporating Transductive Models

Shengjun Cheng,Xianglong Tang,Jiafeng Liu

doi:10.14569/ijarai.2014.030207

Shengjun Cheng, Xianglong Tang + Show 1 more

Open Access

https://doi.org/10.14569/ijarai.2014.030207

Copy DOI

Abstract

This paper shows how to use labeled and unlabeled data to improve inductive models with the help of transductivemodels.We proposed a solution for the self-training scenario. Self- training is an effective semi-supervised wrapper method which can generalize any type of supervised inductive model to the semi-supervised settings. it iteratively refines a inductive model by bootstrap from unlabeled data. Standard self-training uses the classifier model(trained on labeled examples) to label and select candidates from the unlabeled training set, which may be problematic since the initial classifier may not be able to provide highly confident predictions as labeled training data is always rare. As a result, it could always suffer from introducing too much wrongly labeled candidates to the labeled training set, which may severely degrades performance. To tackle this problem, we propose a novel self-training style algorithm which incorporate a graph-based transductive model in the self-labeling process. Unlike standard self-training, our algorithm utilizes labeled and unlabeled data as a whole to label and select unlabeled examples for training set augmentation. A robust transductive model based on graph markov random walk is proposed, which exploits manifold assumption to output reliable predictions on unlabeled data using noisy labeled examples. The proposed algorithm can greatly minimize the risk of performance degradation due to accumulated noise in the training set. Experiments show that the proposed algorithm can effectively utilize unlabeled data to improve classification performance.

Highlights

Traditional inductive models like Naive Bayes, CARTs[1], Support Vector Machines are always in supervised settings, which means these model can only be trained on labeled data
Standard self-training uses the classifier model(trained on labeled examples) to label and select candidates from the unlabeled training set, which may be problematic since the initial classifier may not be able to provide highly confident predictions as labeled training data is always rare
We show that incorporating transductive models to inductive models in semi-supervised settings can improve classification performance

Summary

INTRODUCTION

Traditional inductive models like Naive Bayes, CARTs[1], Support Vector Machines are always in supervised settings, which means these model can only be trained on labeled data. It could always suffer from introducing too much wrongly labeled candidates to the labeled training set, which may severely degrades performance Another drawback of self-training is that the newly added examples are not informative to the current classifier, since they can be classified confidently[7]. While our transductive model can naturally deals with noisy labeled data, which utilize ”label www.ijarai.thesai.org (IJARAI) International Journal of Advanced Research in Artificial Intelligence, Vol 3, No 2, 2014 smooth” to automatically adjust the potential wrong labels By incorporating this transductive model to the self-training process, we expect any applied supervised inductive model can be greatly improved. We propose a novel self-training algorithm which utilizes a graph-based transductive model for using both labeled and unlabeled data to label and select unlabeled example for training set augmentation. We will present the details of the proposed transductive graph-based model

Markov Random Walk with Constrains

Problem Description and Notation

EXPERIMENTS AND DISCUSSION

Findings

CONCLUSIONS