Integrating Low-rank Approximation and Word Embedding for Feature Transformation in the High-dimensional Text Classification

Le Nguyen Hoai Nam,Ho Bao Quoc

doi:10.1016/j.procs.2017.08.058

Abstract

With the Bag-of-Words model, a document corpus can be originally represented by a Terms-Documents matrix. However, the high-dimensional pure Terms-Documents matrix needs transforming to a lower-dimensional semantic Concepts-Documents matrix in order to not only reduce the feature space dimension but also create more meaningful features. This paper analyzes two feature transformation (FT) models on the Terms-Documents matrix, i.e. the FT model based on Low-Rank Approximation (LRA) and the FT model based on Word Embedding (WE). Both of them have their unique strength and weakness in the text transformation. The LRA-based FT only focuses on the mathematical perspective to statistically cover the original dispersed term set of the corpus as well as possible, while the WE-based FT utilizes the available word embedding vectors to enhance the contextual content of the corpus presentation. Therefore, the combinations of the LRA-based FT and the WE-based FT, named LRAintoWE-based FT and WEintoLRA-based FT, are possibly proposed to obtain comprehensive FTs capturing appropriately both the statistical information and the contextual information. The experiment results on three benchmark datasets show that the information of the WE-based FT and the LRA-based FT can be integrated, and their integration as LRAintoWE-based FT and WEintoLRA-based FT can improve the classification performance compared with that based on only either of them.

Full Text