Abstract

Topic modeling and word embedding are commonly used techniques in natural language processing. However, word embedding lacks discriminating information of homonymy and polysemy as they typically assign a single vector for each word even if they have different meanings in different contexts. Many models were proposed to solve this issue by jointly learning topic modeling and word embedding together so that words can have different embeddings under various latent topics. However, the number of latent topics is set manually by human experiences before learning the topic modeling, which makes the final performance of the topical model largely rely on the subjective human judgment. If the number of topics is set too many, the topics learned may suffer from overfitting issues. On the other hand, if the number of topics is too small, the topics may contain little useful information. Experiments from other researchers show that the scale we used to model the latent topics is crucial to the performance of the topical word embeddings. We proposed Multi-scaled Topic Embedding (MTE) to learn the document representation with multiple latent topics for topic modeling based on this idea. With multi-scaled topics, MTE learns topical information on a large scale and captures key information from dense topic distributions. Most importantly, MTE reduces the influence of the number of topics on the model's performance. In this paper, we apply MTE in three commonly used datasets and test the performances of our model for text classification tasks on these datasets. The experimental results show that our model outperforms deep learning baseline models and typical topical word embeddings.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call