A Weighted k-Nearest Neighborhood for BaseNP Detection under Covariate Shift

Jeong-Woo Son,Young-Jin Han,Se-Young Park,Seong-Bae Park

doi:10.1109/alpit.2008.78

Abstract

In common machine learning methods, there is a basic assumption that training data and test data are sampled from the same distribution. However, this assumption is commonly violated in practical fields. The situation where the training and test data are generated from different distributions is so-called covariate shift. In natural language processing, it is highly possible to occur covariate shift due to the size of sample space. Natural language data have theoretically infinite size, which causes that the distribution of training data can not reflect that of entire data. In this paper, we try to verify that the performance of methods on natural language processing can be improved by reducing error from covariate shift. For this purpose, we propose the importance weighted k-NN for base noun detection. In the proposed method, the weights are set as a difference between the training and test distribution. Theoretically, the performance under covariate shift can be improved using importance weight method. In the experiment, the proposed method shows better performance than normal k-NN.

Full Text