UBIS: Unigram Bigram Importance Score for Feature Selection from Short Text

Muskan Garg

doi:10.1016/j.eswa.2022.116563

Abstract

A huge amount of data has been generated over the internet since few decades which is increasing exponentially. It has become difficult to manually classify the online and offline short textual documents. In this context, two major feature extraction techniques are used in existing literature, namely, TFIDF vectorizer and Count vectorizer. The major challenge in the existing feature extraction techniques is the number of textual features extracted. The textual feature reduction techniques are associated with the use of features and its correlation with resulting value or category. However, it is interesting to note that the importance of uni-grams and bi-grams may contribute more efficiently in determining the feature space vector. In this research work, the Graph of Words (GoW) based selective feature extraction technique is proposed as Uni-gram Bi-gram Importance Score (UBIS) as obtained from node score and edge score in Graph of Words. The experimental results show the effectiveness of the UBIS over TFIDF vectorizer and Count Vectorizer which are hybridized with feature selection techniques. To test and validate the experiments, logistic regression with gradient descent is used as the linear classification model over three different binary text classification dataset.

Full Text