A Corpus-Based Approach for Automatic Thai Unknown Word Recognition using Ensemble Learning Techniques

Jakkrit Techo,Cholwich Nattee,Thanaruk Theeramunkong

doi:10.1007/978-3-642-01307-2_50

Abstract

This paper presents a corpus-based approach for automatic unknown word recognition in Thai. This approach applies an ensemble learning technique to generate a model for classifying unknown word candidates using features obtained from a corpus. We propose a technique called evaluation by ranking. It clusters the unknown word candidates into groups based on the occuring locations. The candidate with the highest accuracy is then identified as an unknown word. In this task, the number of positive instances is dominantly smaller than that of negative instances, forming an unbalanced data set. To improve the prediction accuracy, we apply a boosting technique with voting under group-based evaluation by ranking. We have conducted experiments on real-world data to evaluate the performance of the proposed approach. The experiments compared the accuracy of our technique with an ordinary naive Bayes technique. Our technique achieves the accuracy 90.93±0.50% when the first rank is selected and 97.90±0.26% when the candidates up to the tenth rank are considered. This is 6.79% to 8.45% improvement.

Full Text