In this paper, we proposed hybrid-KNN methods on the Arabic script web page language identification. One of the crucial tasks in the text-based language identification that utilizes the same script is how to produce reliable features and how to deal with the huge number of languages in the world. Specifically, it has involved the issue of feature representation, feature selection, identification performance, retrieval performance, and noise tolerance performance. Therefore, there are a number of methods that have been evaluated in this work; k-nearest neighbor (KNN), support vector machine (SVM), backpropagation neural networks (BPNN), hybrid KNN-SVM, and KNN-BPNN, in order to justify the capability of the state-of-the-art methods. KNN is prominent in data clustering or data filtering, SVM and BPNN are well known in supervised classification, and we have proposed hybrid-KNN for noise removal on web page language identification. We have used the standard measurements which are accuracy, precision, recall and F1 measurements to evaluate the effectiveness of the proposed hybrid-KNN. From the experiment, we have observed that BPNN is able to produce precise identification if the data set given is clean. However, when increasing the level of noise in the training data, KNN-SVM performs better than KNN-BPNN against the misclassification data, even on the level of 50% noise. Therefore, it is proven that KNN-SVM produce promising identification performance, in which KNN is able to reduce the noise in the data set and SVM is reliable in the language identification.
Read full abstract