Abstract

Text embedding has recently emerged as a very useful and successful method for semantic representation. Following initial word-level embedding methods such as Latent Semantic Analysis (LSA) and topic-based bag-of-words approaches like Latent Dirichlet Allocation (LDA), the focus has turned to language models and text encoders implemented as neural networks - ranging from word-level models to those embedding whole documents. The distinctive feature of these models is their ability to infer semantic spaces at all levels based purely on data, with no need for complexities such as syntactic analysis or ontology building. Many of these models are available pre-trained on enormous amounts of data, providing downstream applications with general-purpose semantic spaces. In particular, embedding models at the sentence level or higher are most useful in applications because the meaning of text only becomes clear at that level. Most text embedding methods produce text embeddings in high-dimensional spaces, with a dimensionality ranging from a few hundred to thousands. However, it is often useful to visualize semantic spaces in very low dimension, which requires the use of dimensionality reduction methods. It is not clear what language models and what method of dimensionality reduction would work well in these cases. In this paper, we compare four text embedding methods in combination with three methods of dimensionality reduction to map three related real-world datasets comprising textual descriptions of items in a particular domain (sports) to a 2-dimensional semantic visualization space. The results provide several insights into the utility of these methods for data of this type.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.