Semi-supervised classification has become an active topic recently, and a number of algorithms, such as self-training, have been proposed to improve the performance of supervised classification using unlabeled data. Considering the influence of spatial distribution of data set and mislabeled samples on the classification performance of self-training method, an improved self-training algorithm based on density peaks and cut edge weight statistic is proposed in this paper. Firstly, the representative unlabeled samples are selected for labels prediction by space structure, which is discovered by clustering method based on density peaks. Secondly, cut edge weight is used as statistics to make hypothesis testing for identifying whether samples are labeled correctly. Thirdly, the labeled data set is gradually enlarged with correctly labeled samples. The above steps are iterated until all unlabeled samples are labeled. The framework of improved self-training method not only makes full use of space structure information, but also solves the problem that some samples may be classified incorrectly. Thus, the classification accuracy of algorithm is improved in a great measure. Extensive experiments on benchmark data sets clearly illustrate the effectiveness of proposed algorithm.
Read full abstract