Coping with Distribution Change in the Same Domain Using Similarity-Based Instance Weighting

Jeong-Woo Son,Hyun-Je Song,Se-Young Park,Seong-Bae Park

doi:10.1007/978-3-642-05224-8_27

Abstract

Lexicons are considered as the most crucial features in natural language processing (NLP), and thus often used in machine learning algorithms applied to NLP tasks. However, due to the diversity of lexical space, the machine learning algorithms with lexical features suffer from the difference between distributions of training and test data. In order to overcome the distribution change, this paper proposes support vector machines with example-wise weights. The training distribution coincides with the test distribution by weighting training examples according to their similarity to all test data. The experimental results on text chunking show that the distribution change between training and test data is actually recognized and the proposed method which considers this change in its training phase outperforms ordinary support vector machines.

Full Text