Abstract

Although the number of additional resources in Macedonian which can be used for solving information retrieval problem (or general Natural Language Processing problem) is very limited, models exist which are general enough and do not need additional knowledge about the language. This paper presents a document classification model, that doesn't rely on any language specific additional resources. The model is trained and tested on a set of news articles extracted from Macedonian websites, and each document is labeled with a class representing one of the twelve category sections from which the documents were extracted. The goal of this paper is to test different methods for feature selection and choice of vocabulary. Furthermore, we choose a model which gives the best accuracy for document classification task and we make sensitivity analysis on its architecture in order to further improve its performance. Although similar research already exists, this paper aims to combine different experiments and test them on Macedonian language documents. The models used in this paper are Random Forest (RF), Support Vector Machines (SVM) and Neural Network (NN). The performed experiments showed that the best accuracy is achieved when each document is represented as tf-idf vector, the vocabulary contains equal number of representative words from each class, and simple Neural Network with 3 hidden layers is used as a model. The main conclusion is that a language independent model for solving document classification problem can be successfully build for Macedonian language, achieving around 80% accuracy on the test set.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call