Abstract
Dynamic Topic Modeling (DTM) is the ultimate solution for extracting topics from short texts generated in Online Social Networks (OSNs) like Twitter. It requires to be scalable and to be able to account for sparsity and dynamicity of short texts. Current solutions combine probabilistic mixture models like Dirichlet Multinomial or Pitman-Yor Process with approximate inference approaches like Gibbs Sampling and Stochastic Variational Inference to, respectively, account for dynamicity and scalability of DTM. However, these methods basically rely on weak probabilistic language models, which do not account for sparsity in short texts. In addition, their inference is based on iterative optimizations, which have scalability issues when it comes to DTM. We present GDTM, a single-pass graph-based DTM algorithm, to solve the problem. GDTM combines a context-rich and incremental feature representation method with graph partitioning to address scalability and dynamicity and uses a rich language model to account for sparsity. We run multiple experiments over a large-scale Twitter dataset to analyze the accuracy and scalability of GDTM and compare the results with four state-of-the-art models. In result, GDTM outperforms the best model by 11% on accuracy and performs by an order of magnitude faster while creating four times better topic quality over standard evaluation metrics.
Highlights
Motivation topic modeling [1] is the problem of automatic classification of words, which form the context of documents, into similarity groups, known as topics
Given a dataset D with n documents, tagged with k hand labels, L = {l1, . . . , lk} and a classification of the documents into k class labels, C = {c1, . . . , ck}, the B-Cubed of a document d with hand label ld and class label cd is calculated as: we demonstrate the accuracy and scalability of GDTM by running the algorithm over two sets of experiments
We developed GDTM, a solution for dynamic topic modeling on short texts in online social networks
Summary
Motivation topic modeling [1] is the problem of automatic classification of words, which form the context of documents, into similarity groups, known as topics. Documents generated in today’s social media (like Twitter or Facebook) are (i) fast (large scale and continuous), (ii) sparse (short length) and (iii) dynamic (with constant emergent of newly generated phrases or context structures). This is a problem known as Dynamic Topic. A legitimate solution to DTM should constantly receive a large number of short texts, extract their topics and adapt to the changes in the topics
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.