A Comparative Study of Methods for Visualizable Semantic Embedding of Small Text Corpora

Rishabh Choudhary,Ali A Minai,Simona Doboli

doi:10.1109/ijcnn52387.2021.9534250

Abstract

Text embedding has recently emerged as a very useful and successful method for semantic representation. Following initial word-level embedding methods such as Latent Semantic Analysis (LSA) and topic-based bag-of-words approaches like Latent Dirichlet Allocation (LDA), the focus has turned to language models and text encoders implemented as neural networks - ranging from word-level models to those embedding whole documents. The distinctive feature of these models is their ability to infer semantic spaces at all levels based purely on data, with no need for complexities such as syntactic analysis or ontology building. Many of these models are available pre-trained on enormous amounts of data, providing downstream applications with general-purpose semantic spaces. In particular, embedding models at the sentence level or higher are most useful in applications because the meaning of text only becomes clear at that level. Most text embedding methods produce text embeddings in high-dimensional spaces, with a dimensionality ranging from a few hundred to thousands. However, it is often useful to visualize semantic spaces in very low dimension, which requires the use of dimensionality reduction methods. It is not clear what language models and what method of dimensionality reduction would work well in these cases. In this paper, we compare four text embedding methods in combination with three methods of dimensionality reduction to map three related real-world datasets comprising textual descriptions of items in a particular domain (sports) to a 2-dimensional semantic visualization space. The results provide several insights into the utility of these methods for data of this type.

Full Text