Short Text Topic Modeling Research Articles

With the exponential growth in the daily publication of scientific articles, automatic classification and categorization can assist in assigning articles to a predefined category. Article titles are concise descriptions of the articles' content with valuable information that can be useful in document classification and categorization. However, shortness, data sparseness, limited word occurrences, and the inadequate contextual information of scientific document titles hinder the direct application of conventional text mining and machine learning algorithms on these short texts, making their classification a challenging task. This study firstly explores the performance of our earlier study, TextNetTopics on the short text. Secondly, here we propose an advanced version called TextNetTopics Pro, which is a novel short-text classification framework that utilizes a promising combination of lexical features organized in topics of words and topic distribution extracted by a topic model to alleviate the data-sparseness problem when classifying short texts. We evaluate our proposed approach using nine state-of-the-art short-text topic models on two publicly available datasets of scientific article titles as short-text documents. The first dataset is related to the Biomedical field, and the other one is related to Computer Science publications. Additionally, we comparatively evaluate the predictive performance of the models generated with and without using the abstracts. Finally, we demonstrate the robustness and effectiveness of the proposed approach in handling the imbalanced data, particularly in the classification of Drug-Induced Liver Injury articles as part of the CAMDA challenge. Taking advantage of the semantic information detected by topic models proved to be a reliable way to improve the overall performance of ML classifiers.

Read full abstract

With the emerging of massive short texts, e.g., social media posts and question titles from Q&A systems, discovering valuable information from them is increasingly significant for many real-world applications of content analysis. The family of topic modeling can effectively explore the hidden structures of documents through the assumptions of latent topics. However, due to the sparseness of short texts, the existing topic models, e.g., latent Dirichlet allocation, lose effectiveness on them. To this end, an effective solution, namely Dirichlet multinomial mixture (DMM), supposing that each short text is only associated with a single topic, indirectly enriches document-level word co-occurrences. However, DMM is sensitive to noisy words, where it often learns inaccurate topic representations at the document level. To address this problem, we extend DMM to a novel Laplacian Dirichlet Multinomial Mixture (LapDMM) topic model for short texts. The basic idea of LapDMM is to preserve local neighborhood structures of short texts, enabling to spread topical signals among neighboring documents, so as to modify the inaccurate topic representations. This is achieved by incorporating the variational manifold regularization into the variational objective of DMM, constraining the close short texts with similar variational topic representations. To find nearest neighbors of short texts, before model inference, we construct an offline document graph, where the distances of short texts can be computed by the word mover’s distance. We further develop an online version of LapDMM, namely Online LapDMM, to achieve inference speedup on massive short texts. Carrying this implications, we exploit the spirit of stochastic optimization with mini-batches and an up-to-date document graph that can efficiently find approximate nearest neighbors instead. To evaluate our models, we compare against the state-of-the-art short text topic models on several traditional tasks, i.e., topic quality, document clustering and classification. The empirical results demonstrate that our models achieve very significant performance gains over the baseline models.

Read full abstract

Short Text Topic Modeling Research Articles

Related Topics

Articles published on Short Text Topic Modeling

Short-text topic modeling with dual reinforcement from internal and external semantics

Applying short text topic models to instant messaging communication of software developers

A semi-supervised approach of short text topic modeling using embedded fuzzy clustering for Twitter hashtag recommendation

TextNetTopics Pro, a topic model-based text classification for short text by integration of semantic and document-topic distribution information.

Topic modeling methods for short texts: A survey

A systematic review of the use of topic models for short text social media analysis

Short text topic modelling using local and global word-context semantic correlation.

Short text topic modelling approaches in the context of big data: taxonomy, survey, and analysis.

Using Delay Logs and Machine Learning to Support Passenger Railway Operations

Topic extraction from extremely short texts with variational manifold regularization

Topic prediction and knowledge discovery based on integrated topic modeling and deep neural networks approaches

A Pseudo-document-based Topical N-grams model for short texts

Short Text Topic Modeling Techniques, Applications, and Performance: A Survey

Dirichlet Multinomial Mixture with Variational Manifold Regularization: Topic Modeling over Short Texts

Incorporating word embeddings into topic modeling of short text

Experimental explorations on short text topic mining between LDA and NMF based Schemes

Topic modeling for analyzing open-ended survey responses

Filtering out the noise in short text topic modeling

GLTM: A Global and Local Word Embedding-Based Topic Model for Short Texts

Short text topic modeling by exploring original documents

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Short Text Topic Modeling Research Articles

Related Topics

Articles published on Short Text Topic Modeling

Short-text topic modeling with dual reinforcement from internal and external semantics

Applying short text topic models to instant messaging communication of software developers

A semi-supervised approach of short text topic modeling using embedded fuzzy clustering for Twitter hashtag recommendation

TextNetTopics Pro, a topic model-based text classification for short text by integration of semantic and document-topic distribution information.

Topic modeling methods for short texts: A survey

A systematic review of the use of topic models for short text social media analysis

Short text topic modelling using local and global word-context semantic correlation.

Short text topic modelling approaches in the context of big data: taxonomy, survey, and analysis.

Using Delay Logs and Machine Learning to Support Passenger Railway Operations

Topic extraction from extremely short texts with variational manifold regularization

Topic prediction and knowledge discovery based on integrated topic modeling and deep neural networks approaches

A Pseudo-document-based Topical N-grams model for short texts

Short Text Topic Modeling Techniques, Applications, and Performance: A Survey

Dirichlet Multinomial Mixture with Variational Manifold Regularization: Topic Modeling over Short Texts

Incorporating word embeddings into topic modeling of short text

Experimental explorations on short text topic mining between LDA and NMF based Schemes

Topic modeling for analyzing open-ended survey responses

Filtering out the noise in short text topic modeling

GLTM: A Global and Local Word Embedding-Based Topic Model for Short Texts

Short text topic modeling by exploring original documents