Short Text Classification based on feature extension using The N-Gram model

Xinwei Zhang Xinwei Zhang,Bin Wu Bin Wu

doi:10.1109/fskd.2015.7382029

Abstract

With the rapid development of Web2.0, more and more people like to show their life or opinions on social media websites or forums, such as Weibo, Twitter and Tianya, which produce masses of short texts. In order to manage these short texts effectively, Short Text Classification becomes an important branch of Text Classification. However, because of the short text length, the lack of signals, and the sparseness of features, it is very difficult to achieve high quality classification by using conventional methods. This paper proposes a novelty feature extending method based on the N-Gram model to solve the problem of feature sparseness. From continuous word sequences in the train set, we extract n-grams as our feature extension mode library. Then using features showing in the short texts, we can compute the appearance probability of other words that do not exist in original texts. We use the data set collected from Sina Weibo to carry out our extension method. After extending features of the original short texts, we use the Naive Bayes algorithm to train and evaluate a classifier. We use precision, recall and F1-Score to evaluate our work. The result shows that the extension method based on the N-Gram model can improve classification performance observably.

Full Text