Word2vec semantic representation in multilabel classification for Indonesian news article

Dyah Rahmawati,Masayu Leylia Khodra

doi:10.1109/icaicta.2016.7803115

Abstract

Mutilabel text classification is task to categorize a text into one or more categories. Like other supervised learning, performance of multilabel classification is limited when there are small labeled data and it leads to the difficulty of capturing semantic relationship. The previous research of multilabel classification for Indonesian news article focused on implementing multilabel classification using lexical feature that employed bag of words and TF-IDF term weighting, and there is no work yet that uses semantic features. The purpose of this paper is to present an implementation of multilabel classification using semantic feature based on Word2vec. Word2vec is an unsupervised task that is capable of utilizing unlabeled data to convert a word into its vector representation that can also find the semantic relationship between words by counting their distance. The experiment shows that the result using this semantic feature improves the previous result that used traditional bag of words and TF-IDF method. It escalates the testing F-measure value from 76.73% to 80.17%.

Full Text