Abstract
With the rapid proliferation of social networking sites (SNS), automatic topic extraction from various text messages posted on SNS are becoming an important source of information for understanding current social trends or needs. Latent Dirichlet Allocation (LDA), a probabilistic generative model, is one of the popular topic models in the area of Natural Language Processing (NLP) and has been widely used in information retrieval, topic extraction, and document analysis. Unlike long texts from formal documents, messages on SNS are generally short. Traditional topic models such as LDA or pLSA (probabilistic latent semantic analysis) suffer performance degradation for short-text analysis due to a lack of word co-occurrence information in each short text. To cope with this problem, various techniques are evolving for interpretable topic modeling for short texts, pretrained word embedding with an external corpus combined with topic models is one of them. Due to recent developments of deep neural networks (DNN) and deep generative models, neural-topic models (NTM) are emerging to achieve flexibility and high performance in topic modeling. However, there are very few research works on neural-topic models with pretrained word embedding for generating high-quality topics from short texts. In this work, in addition to pretrained word embedding, a fine-tuning stage with an original corpus is proposed for training neural-topic models in order to generate semantically coherent, corpus-specific topics. An extensive study with eight neural-topic models has been completed to check the effectiveness of additional fine-tuning and pretrained word embedding in generating interpretable topics by simulation experiments with several benchmark datasets. The extracted topics are evaluated by different metrics of topic coherence and topic diversity. We have also studied the performance of the models in classification and clustering tasks. Our study concludes that though auxiliary word embedding with a large external corpus improves the topic coherency of short texts, an additional fine-tuning stage is needed for generating more corpus-specific topics from short-text data.
Highlights
Due to the rapid developments of computing and communication technologies and the widespread use of internet, people are gradually becoming accustomed to communicating through various online social platforms, such as microblogs, Twitter, webpages, Facebook, etc
We found that pretrained word embedding enhances the topic coherence of short texts that are similar to long and formal texts, the generated topics were often comprised of words having common meanings instead of the particular short-text-specific semantics of the word, which is especially important for real-world datasets
The simulation experiments have been performed with several benchmark datasets, and the performance of the topic models are evaluated by topic coherence and topic diversity measures
Summary
Due to the rapid developments of computing and communication technologies and the widespread use of internet, people are gradually becoming accustomed to communicating through various online social platforms, such as microblogs, Twitter, webpages, Facebook, etc. In the area of traditional natural language processing, a topicmodeling algorithm is considered an effective technique for the semantic understanding of text documents Conventional topic models, such as pLSA [1] or LDA [2] and their various variants, are considerably good at extracting latent semantic structures from a text corpus without prior annotations and are widely used in emerging topic detection, document classification, comment summarizing, or event tracking. In these models, documents are viewed as a mixture of topics, while each topic is viewed as a particular distribution over all the words. The efficient capture of document-level word co-occurrence patterns leads to the success of topic modeling
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.