Text classification based on the word subspace representation

Erica K Shimomoto,François Portet,Kazuhiro Fukui

doi:10.1007/s10044-021-00960-6

Abstract

In this paper, we propose a novel framework for text classification based on subspace-based methods. Recent studies showed the advantages of modeling texts as linear subspaces in a high-dimensional word vector space, to which we refer as word subspace. Therefore, we propose solving topic classification and sentiment analysis by using the word subspace along with different subspace-based methods. We explore the word embeddings geometry to decide which subspace-based method is more suitable for each task. We empirically demonstrate that a word subspace generated from sets of texts is a unique representation of a semantic topic that can be spanned by basis vectors derived from different texts. Therefore, texts can be classified by comparing their word subspace with the topic class subspaces. We achieve this framework by using the mutual subspace method that effectively handles multiple subspaces for classification. For sentiment analysis, as word embeddings do not necessarily consider sentiment information (i.e., opposite sentiment words have similar word vectors), we introduce the orthogonal mutual subspace method, to push opposite sentiment words apart. Furthermore, as there may be overlap between the sentiment class subspaces due to overlapping topics, we propose modeling a sentiment class by a set of multiple word subspaces, generated from each text belonging to the class. We further model the sentiment classes on a Grassmann manifold by using the Grassmann subspace method and its discriminative extension, the Grassmann orthogonal subspace method. We show the validity of each framework through experiments on four widely used datasets.

Full Text