Abstract

It has become common to collect massive datasets in modern applications. The massive and highly noise contaminated data pose serious challenges to conventional semi-supervised learning methods. To tackle such challenges from the large-quantity-low-quality situation, we propose a distribution-free Markov subsampling strategy based on Laplacian support vector machine (LapSVM) to achieve robust and effective estimation. The core idea is to construct an informative subset which allows us to conservatively correct a rough initial estimate towards the true classifier. Specifically, the proposed subsampling strategy selects samples with small losses via a probabilistic procedure, constructing a subset which stands a good chance of excluding the noise data and providing a safe improvement over the rough initial estimate. Theoretically, we show that the obtained classifier is statistically consistent and can achieve fast learning rate under mild conditions. The promising performance is also supported by simulation studies and real data examples.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call