BETM: A new pre-trained BERT-guided embedding-based topic model
BETM: A new pre-trained BERT-guided embedding-based topic model
- Research Article
1
- 10.1016/j.procs.2024.05.069
- Jan 1, 2024
- Procedia Computer Science
Sustainable Topic Modeling for Legal Moroccan Arabic Language: A Challenging Study on BERTopic Technique
- Research Article
33
- 10.1109/access.2018.2863260
- Jan 1, 2018
- IEEE Access
Short texts have become a kind of prevalent source of information, and discovering topical information from short text collections is valuable for many applications. Due to the length limitation, conventional topic models based on document-level word co-occurrence information often fail to distill semantically coherent topics from short text collections. On the other hand, word embeddings as a powerful tool have been successfully applied in natural language processing. Word embeddings trained on large corpus are encoded with general semantic and syntactic information of words, and hence they can be leveraged to guide topic modeling for short text collections as supplementary information for sparse co-occurrence patterns. However, word embeddings are trained on large external corpus and the encoded information is not necessarily suitable for training data set of topic models, which is ignored by most existing models. In this article, we propose a novel global and local word embedding-based topic model (GLTM) for short texts. In the GLTM, we train global word embeddings from large external corpus and employ the continuous skip-gram model with negative sampling (SGNS) to obtain local word embeddings. Utilizing both the global and local word embeddings, the GLTM can distill semantic relatedness information between words which can be further leveraged by Gibbs sampler in the inference process to strengthen semantic coherence of topics. Compared with five state-of-the-art short text topic models on four real-world short text collections, the proposed GLTM exhibits the superiority in most cases.
- Research Article
6
- 10.1016/j.patrec.2023.06.007
- Jun 8, 2023
- Pattern Recognition Letters
WETM: A word embedding-based topic model with modified collapsed Gibbs sampling for short text
- Research Article
3
- 10.1145/3607189
- Nov 24, 2023
- ACM Transactions on Software Engineering and Methodology
Technical Q&A sites, such as Stack Overflow and Ask Ubuntu, have been widely utilized by software engineers to seek support for development challenges. However, not all the raised questions get instant feedback, and the retrieved answers can vary in quality. The users can hardly avoid spending much time before solving their problems. Prior studies propose approaches to automatically recommend answers for the question posts on technical Q&A sites. However, the lengthiness and the lack of background knowledge issues limit the performance of answer recommendation on these sites. The irrelevant sentences in the posts may introduce noise to the semantics learning and prevent neural models from capturing the gist of texts. The lexical gap between question and answer posts further misleads current models to make failure recommendations. From this end, we propose a novel neural network named TopicAns for answer selection on technical Q&A sites. TopicAns aims at learning high-quality representations for the posts in Q&A sites with a neural topic model and a pre-trained model. This involves three main steps: (1) generating topic-aware representations of Q&A posts with the neural topic model, (2) incorporating the corpus-level knowledge from the neural topic model to enhance the deep representations generated by the pre-trained language model, and (3) determining the most suitable answer for a given query based on the topic-aware representation and the deep representation. Moreover, we propose a two-stage training technique to improve the stability of our model. We conduct comprehensive experiments on four benchmark datasets to verify our proposed TopicAns’s effectiveness. Experiment results suggest that TopicAns consistently outperforms state-of-the-art techniques by over 30% in terms of Precision@1.
- Conference Article
- 10.1145/3582935.3582939
- Nov 4, 2022
Extracting topics from documents is a common task in the field of Natural Language Processing (NLP). Both traditional feature extraction methods and various topic models can be used for such tasks of finding key information. Latent Dirichlet Allocation (LDA) is one of the classic topic models. The recently popular deep learning pre-training model has greatly improved the effect of various NLP tasks, and the method of applying the pre-training model to downstream tasks has research value. The application of Chinese pre-trained models also requires more attempts. This paper believes that combining deep learning technology can help to improve traditional methods and find key information. Therefore, based on the deep learning knowledge tagging model WordTag, we combine it with the results of knowledge tagging and LDA topic model, and propose a topic extraction method based on word classification tagging (WordTag and Latent Dirichlet Allocation, WT-LDA). Experiments show that the method proposed in this paper is more effective than other methods of topic extraction.
- Research Article
2
- 10.1080/13658816.2023.2213869
- Jun 26, 2023
- International Journal of Geographical Information Science
The use of social media and location-based networks through GPS-enabled devices provides geospatial data for a plethora of applications in urban studies. However, the extent to which information found in geo-tagged social media activity corresponds to the spatial context is still a topic of debate. In this article, we developed a framework aimed at retrieving the thematic and spatial relationships between content originated from space-based (Twitter) and place-based (Google Places and OSM) sources of geographic user-generated content based on topics identified by the embedding-based BERTopic model. The contribution of the framework lies on the combination of methods that were selected to improve previous works focused on content-location relationships. Using the city of Lisbon (Portugal) to test our methodology, we first applied the embedding-based topic model to aggregated textual data coming from each source. Results of the analysis evidenced the complexity of content-location relationships, which are mostly based on thematic profiles. Nonetheless, the framework can be employed in other cities and extended with other metrics to enrich the research aimed at exploring the correlation between online discourse and geography.
- Conference Article
3
- 10.1109/icdm51629.2021.00200
- Dec 1, 2021
Keyphrase annotation task aims to retrieve the most representative phrases that express the essential gist of documents. In reality, some phrases that best summarize documents are often absent from the original text, which motivates researchers to develop generation methods, being able to create phrases. Existing generation approaches usually adopt the encoder-decoder framework for sequence generation. However, the widely-used recurrent neural network might fail to capture long-range dependencies among items. In addition, intuitively, as keyphrases are likely to correlate with topical words, some methods propose to introduce topic models into keyphrase generation. But they hardly leverage the global information of topics. In view of this, we employ the Transformer architecture with the pre-trained BERT model as the encoder-decoder framework for keyphrase generation. BERT and Transformer are demonstrated to be effective for many text mining tasks. But they have not been extensively studied for keyphrase generation. Furthermore, we propose a topic attention mechanism to utilize the corpus-level topic information globally for keyphrase generation. Specifically, we propose BertTKG, a keyphrase generation method that uses a contextualized neural topic model for corpus-level topic representation learning, and then enhances the document representations learned by pre-trained language model for better keyphrase decoding. Extensive experiments conducted on three public datasets manifest the superiority of BertTKG.
- Research Article
19
- 10.1016/j.neucom.2021.10.047
- Oct 21, 2021
- Neurocomputing
A graph convolutional topic model for short and noisy text streams
- Book Chapter
77
- 10.1007/978-3-319-57529-2_29
- Jan 1, 2017
Inferring topics from the overwhelming amount of short texts becomes a critical but challenging task for many content analysis tasks, such as content charactering, user interest profiling, and emerging topic detecting. Existing methods such as probabilistic latent semantic analysis (PLSA) and latent Dirichlet allocation (LDA) cannot solve this prob- lem very well since only very limited word co-occurrence information is available in short texts. This paper studies how to incorporate the external word correlation knowledge into short texts to improve the coherence of topic modeling. Based on recent results in word embeddings that learn se- mantically representations for words from a large corpus, we introduce a novel method, Embedding-based Topic Model (ETM), to learn latent topics from short texts. ETM not only solves the problem of very limited word co-occurrence information by aggregating short texts into long pseudo- texts, but also utilizes a Markov Random Field regularized model that gives correlated words a better chance to be put into the same topic. The experiments on real-world datasets validate the effectiveness of our model comparing with the state-of-the-art models.
- Conference Article
33
- 10.18653/v1/2020.emnlp-main.35
- Jan 1, 2020
ive document summarization is a comprehensive task including document understanding and summary generation, in which area Transformer-based models have achieved the state-of-the-art performance. Compared with Transformers, topic models are better at learning explicit document semantics, and hence could be integrated into Transformers to further boost their performance. To this end, we rearrange and explore the semantics learned by a topic model, and then propose a topic assistant (TA) including three modules. TA is compatible with various Transformer-based models and user-friendly since i) TA is a plug-and-play model that does not break any structure of the original Transformer network, making users easily fine-tune Transformer+TA based on a well pre-trained model; ii) TA only introduces a small number of extra parameters. Experimental results on three datasets demonstrate that TA is able to improve the performance of several Transformer-based models.
- Research Article
3
- 10.3390/app10030834
- Jan 24, 2020
- Applied Sciences
The hallmarks of cancer represent an essential concept for discovering novel knowledge about cancer and for extracting the complexity of cancer. Due to the lack of topic analysis frameworks optimized specifically for cancer data, the studies on topic modeling in cancer research still have a strong challenge. Recently, deep learning (DL) based approaches were successfully employed to learn semantic and contextual information from scientific documents using word embeddings according to the hallmarks of cancer (HoC). However, those are only applicable to labeled data. There is a comparatively small number of documents that are labeled by experts. In the real world, there is a massive number of unlabeled documents that are available online. In this paper, we present a multi-task topic analysis (MTTA) framework to analyze cancer hallmark-specific topics from documents. The MTTA framework consists of three main subtasks: (1) cancer hallmark learning (CHL)—used to learn cancer hallmarks on existing labeled documents; (2) weak label propagation (WLP)—used to classify a large number of unlabeled documents with the pre-trained model in the CHL task; and (3) topic modeling (ToM)—used to discover topics for each hallmark category. In the CHL task, we employed a convolutional neural network (CNN) with pre-trained word embedding that represents semantic meanings obtained from an unlabeled large corpus. In the ToM task, we employed a latent topic model such as latent Dirichlet allocation (LDA) and probabilistic latent semantic analysis (PLSA) model to catch the semantic information learned by the CNN model for topic analysis. To evaluate the MTTA framework, we collected a large number of documents related to lung cancer in a case study. We also conducted a comprehensive performance evaluation for the MTTA framework, comparing it with several approaches.
- Research Article
139
- 10.1016/j.is.2022.102131
- Oct 17, 2022
- Information Systems
Topic modeling algorithms and applications: A survey
- Research Article
7
- 10.1016/j.nlp.2023.100044
- Dec 5, 2023
- Natural Language Processing Journal
Benchmarking topic models on scientific articles using BERTeley
- Book Chapter
4
- 10.1007/978-3-030-34518-1_7
- Jan 1, 2019
In this article an unsupervised approach for analysis of labor market requirements allowing to solve the problem of discovering latent specializations within broadly defined professions is presented. For instance, for the profession of “programmer” such specializations could be “CNC programmer”, “mobile developer”, “frontend developer” and so on. Various statistical methods of text vector representations: TF-IDF, probabilistic topic modeling, neural language models based on distributional semantics (word2vec, fasttext) and deep contextualized word representation (ELMo and multilingual BERT) have been experimentally evaluated. Both pre-trained models and models trained on the texts of job vacancies in Russian have been investigated. The experiments were conducted on dataset provided by online recruitment platforms. Several types of clustering methods: K-means, Affinity Propagation, BIRCH, Agglomerative clustering, and HDBSCAN have been tested. In case of predetermined clusters’ number (k-means, agglomerative) the best result was achieved by ARTM. However, if the number of clusters was not specified ahead of time, word2vec trained on our job vacancies dataset has outperformed other models. The models trained on our corpora perform much better than pre-trained models with large even multilingual vocabulary.
- Conference Article
3
- 10.1109/ijcnn52387.2021.9534093
- Jul 18, 2021
The automatic topic labeling model aims at generating a sound, interpretable, and meaningful topic label that is used to interpret an LDA-style discovered topic, intending to reduce the cognitive load of end-users while browsing or investigating the topics. In this study, we first introduced the pre-trained language model BERT to topic labeling tasks. It exploits the contextual embedding of the pre-trained language model to improve the quality of encoding sentences. To generate a topic label with higher Relevance, Coverage, and Discrimination, we propose a novel summarization neural framework. Specifically, it exploits the paired-attention to model the relationship between the candidate sentences first and then decides which sentences should be included in the final summarization topic label. Moreover, we expected that high-quality sentence encoding representation could improve our model's performance. So, for each discovered topic, we trained a specific layer to extract the important topic-related features from the sentence embeddings as well as filter the noise information. The experimental results showed that our model significantly outperforms the state-of-the-art and classic topic labeling models.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.