Analysing the Impact of Removing Infrequent Terms on Topic Quality in Latent Dirichlet Allocation Models
An initial procedure in text-as-data applications is text preprocessing. One of the typical steps, which can substantially facilitate computations, consists in removing infrequent terms believed to provide limited information about the corpus. Despite the popularity of vocabulary pruning, there are not many guidelines on how to implement it in the literature. The aim of the paper is to fill this gap by examining the effects of removing infrequent terms for the quality of topics estimated using latent Dirichlet allocation (LDA). The analysis is based on Monte Carlo experiments taking into account different criteria for infrequent term removal and various evaluation metrics. The results indicate that pruning is often beneficial and that the share of vocabulary that might be eliminated can be quite considerable.
- Book Chapter
2
- 10.1007/978-981-10-2338-5_49
- Jan 1, 2016
LDA (Latent Dirichlet Allocation) model is a kind of unsupervised learning model which can extract the hidden topic from text in recent years. In this paper, we proposed a novel LDA model based on the traditional LDA model, which is integrated into the information of text category (Activity-topic LDA). In this paper, the Activity-topic LDA is proposed to improve the original latent Dirichlet allocation (LDA) model. On the basis of the LDA, the proposed method adds the tourism activity information, and obtains the probability distribution model of the tourism activities. Based on this model, we can identify and discover the theme of tourism activities.
- Conference Article
4
- 10.1109/icip.2015.7351688
- Sep 1, 2015
Recently, the bag-of-visual words (BoW) models have widely been studied in computer vision area. Owing to the limit of the BoW models that only consider the distributions of visual words in images, the Latent Dirichlet Allocation (LDA) model has drawn an attention to discover the structure of the visual word distributions over latent topics which can represent semantic objects in images. In order to reflect the spatial information of images, the LDA model has been extended to so-called a spatial LDA model for image segmentation, which is not applicable for image classification. Therefore, in this paper, we propose a spatial class LDA (scLDA) model for image classification where the topic distributions over visual words are found per image segments and a class-specific-simplex LDA (cssLDA) model is applied for image classification. From our experimental results, the proposed scLDA model outperforms the previous LDA models in terms of correct classification rates.
- Research Article
- 10.3724/sp.j.1087.2010.03401
- Jan 7, 2011
- Journal of Computer Applications
Concerning the Web document annotation techniques available have weakness in integrity annotation,Latent Dirichlet Allocation(LDA) model was applied to semantic annotation.By embedding document domain information to LDA model,a new LDA model called domain-enabled LDA was introduced.An association between the statistical topical model and domain ontology was established,so the implied topic generated could be interpreted by concepts and an explicit semantic in document was acquired.Because the LDA model assigned a topic to each word in document,a multi-granularity annotation strategy was proposed.The experiments on 20news-group and WebKB show that the domain-enabled LDA model proposed can improve the annotation effectiveness and the multi-granularity annotation method helps different types of query in information retrieval.
- Research Article
1
- 10.1080/2150704x.2019.1706006
- Dec 26, 2019
- Remote Sensing Letters
ABSTRACTLight detection and ranging are important methods for acquiring digital surface models and can be used to extract building data. Point-cloud detection of buildings is a prerequisite for the model-based expression of buildings. Existing methods are insufficient because of their abstractness of feature extraction and poor accuracy of the detection results. This paper proposes a method for the point-cloud detection of buildings based on a latent Dirichlet allocation (LDA) model with waveform data. This method can extract waveform data via the global convergence Levenberg Marquard algorithm, convert discrete point clouds into point-cluster objects via super voxel segmentation, and detect the point clouds of buildings via the LDA model. Moreover, it supports vector machine classification. Experimental results demonstrate that waveform features and the LDA model both improve the accuracy of building detection. In addition, this method is less susceptible to variations in feature dimensions and is robust in terms of the number of topics and words.
- Research Article
2
- 10.3758/s13428-025-02696-1
- Jan 1, 2025
- Behavior Research Methods
In the semantic variant of verbal fluency tests (VFTs), clustering analysis has become popular for examining the semantic structure. While the computational psycholinguistics approach has recently drawn attention to increasing the reproducibility of clustering analysis, such an approach is not available in all languages. To make the computational approach available in the Japanese language, we constructed a Japanese latent Dirichlet allocation (LDA) model. Our LDA model enables researchers and clinicians to objectively quantify the associative relationships of words, thereby making it possible to automatically detect semantic clusters. We conducted the semantic VFT with healthy young Japanese adults to examine the validity of our LDA model. We performed clustering analyses using the computational approach with our LDA model and the conventional manual approach with human coders. The results showed that the LDA model identified semantic clusters, as did the human coders. In addition, we demonstrated for the first time that response intervals within a cluster were significantly shorter than those outside of clusters, regardless of the clustering approaches. This indicates that both approaches reflect a broadly accepted assumption that closer semantic relations require less processing time. However, LDA-based clustering produced, on average, larger clusters than human-based clustering did, indicating that the LDA model captured semantic relationships between words that human coders would not recognize. Taken together, the present results demonstrated the validity of our LDA model. We hope that our LDA model fosters the use of the computational linguistic approach in semantic VFTs with Japanese participants.
- Research Article
- 10.4314/swj.v19i4.8
- Feb 14, 2025
- Science World Journal
The study investigated the depth of machine learning's capacity to perform prediction tasks. The study used textual data, specifically the daily actions of cryptocurrency (Bitcoin) dealers, which were found in news articles. The data was employed merely because it produced crowd knowledge of trade from News articles that affected the market price trend. For the goal of making predictions, 4073 pre-processed, scraped news articles from CNBC's market section website were analysed using the Latent Dirichlet Allocation (LDA) model and its variation, the Supervised Latent Dirichlet Allocation Model (sLDA). The document-term matrix and "k" with different values ranging from 3 to 200 were used to train and test the models. The study used four metrics for evaluation because of our multinomial classification method: mean absolute percentage error (MAPE), mean absolute error (MAE), root mean square error (RMSE), and R2. The outcome demonstrated that for label prediction for unlabeled new documents, the sLDA model performed better than the LDA model plus (classification or regression model). The response variable, which was tagged "users' or traders' interest," was the daily closing price of each corresponding document.
- Research Article
43
- 10.1186/1471-2105-7-250
- May 8, 2006
- BMC Bioinformatics
BackgroundThe statistical modeling of biomedical corpora could yield integrated,coarse-to-fine views of biological phenomena that complement discoveriesmade from analysis of molecular sequence and profiling data. Here, thepotential of such modeling is demonstrated by examining the 5,225 free-textitems in the Caenorhabditis Genetic Center (CGC) Bibliography usingtechniques from statistical information retrieval. Items in the CGCbiomedical text corpus were modeled using the Latent Dirichlet Allocation(LDA) model. LDA is a hierarchical Bayesian model which represents adocument as a random mixture over latent topics; each topic is characterizedby a distribution over words.ResultsAn LDA model estimated from CGC items had better predictive performance thantwo standard models (unigram and mixture of unigrams) trained using the samedata. To illustrate the practical utility of LDA models of biomedicalcorpora, a trained CGC LDA model was used for a retrospective study ofnematode genes known to be associated with life span modification. Corpus-,document-, and word-level LDA parameters were combined with terms from theGene Ontology to enhance the explanatory value of the CGC LDA model, and tosuggest additional candidates for age-related genes. A novel, pairwisedocument similarity measure based on the posterior distribution on the topicsimplex was formulated and used to search the CGC database for "homologs" ofa "query" document discussing the life span-modifying clk-2 gene.Inspection of these document homologs enabled and facilitated the productionof hypotheses about the function and role of clk-2.ConclusionLike other graphical models for genetic, genomic and other types ofbiological data, LDA provides a method for extracting unanticipated insightsand generating predictions amenable to subsequent experimentalvalidation.
- Research Article
- 10.1177/14727978241299236
- Nov 5, 2024
- Journal of Computational Methods in Sciences and Engineering
The study aims to investigate the integration of artificial intelligence technology into Weibo sentiment analysis, aiming to enhance the effectiveness of Weibo in human-computer interaction education. Initially, the Weibo sentiment dictionary is created, and a conventional model for sentiment analysis of user-forwarded Weibo is introduced, specifically the Latent Dirichlet Allocation (LDA) model. Then, the deep learning models in the field of artificial intelligence, namely, the convolutional neural network (CNN) model and the long short-term memory network (LSTM) model, are proposed. Tencent Weibo data set is obtained through Application Programming Interface (API) crawler. The experimental environment of the deep learning model is analyzed and the data set is preprocessed. The results show that when the number of topics is 120, the relative maximum value of F1 is 69.92% and 69.96% with and without the introduction of emotional features in the LDA model, respectively. The accuracy of CNN model and LSTM model is 0.793 and 0.849, respectively. In the three cases of user characteristics, user characteristics + Weibo features, and use characteristics + Weibo features + relationship characteristics, the polarity of the forwarded comments of the LDA model doesn’t change much. In conclusion, the LDA model demonstrates universality and accuracy in sentiment analysis of user-forwarded Weibo, while LSTM proves to be more suitable for sentiment classification in this context. Leveraging the LDA deep learning model, LSTM effectively analyzes the sentiment of users forwarding Weibo. These findings serve as an experimental foundation for the efficient integration of Weibo in human-computer interaction education.
- Book Chapter
2
- 10.1007/978-3-319-18117-2_44
- Jan 1, 2015
Speech analytics suffer from poor automatic transcription quality. To tackle this difficulty, a solution consists in mapping transcriptions into a space of hidden topics. This abstract representation allows to work around drawbacks of the ASR process. The well-known and commonly used one is the topic-based representation from a Latent Dirichlet Allocation (LDA). During the LDA learning process, distribution of words into each topic is estimated automatically. Nonetheless, in the context of a classification task, LDA model does not take into account the targeted classes. The supervised Latent Dirichlet Allocation (sLDA) model overcomes this weakness by considering the class, as a response, as well as the document content itself. In this paper, we propose to compare these two classical topic-based representations of a dialogue (LDA and sLDA), with a new one based not only on the dialogue content itself (words), but also on the theme related to the dialogue. This original Author-topic Latent Variables (ATLV) representation is based on the Author-topic (AT) model. The effectiveness of the proposed ATLV representation is evaluated on a classification task from automatic dialogue transcriptions of the Paris Transportation customer service call. Experiments confirmed that this ATLV approach outperforms by far the LDA and sLDA approaches, with a substantial gain of respectively 7.3 and 5.8 points in terms of correctly labeled conversations.
- Research Article
1
- 10.1108/jm2-07-2021-0163
- Sep 29, 2021
- Journal of Modelling in Management
PurposeIt has always been a hot topic for online retailers to obtain consumers’ product evaluations from massive online reviews. In the process of online shopping, there is no face-to-face interaction between online retailers and customers. After collecting online reviews left by customers, online retailers are eager to acquire answers to some questions. For example, which product attributes will attract consumers? Or which step brings a better experience to consumers during the process of shopping? This paper aims to associate the latent Dirichlet allocation (LDA) model with the consumers’ attitude and provides a method to calculate the numerical measure of consumers’ product evaluation expressed in each word.Design/methodology/approachFirst, all possible pairs of reviews are organized as a document to build the corpus. After that, latent topics of the traditional LDA model noted as the standard LDA model, are separated into shared and differential topics. Then, the authors associate the model with consumers’ attitudes toward each review which is distinguished as positive review and non-positive review. The product evaluation reflected in consumers’ binary attitude is expanded to each word that appeared in the corpus. Finally, a variational optimization is introduced to calculate parameters mentioned in the expanded LDA model.FindingsThe experiment’s result illustrates that the LDA model in the research noted as an expanded LDA model, can successfully assign sufficient probability with words related to products attributes or consumers’ product evaluation. Compared with the standard LDA model, the expanded model intended to assign higher probability with words, which have a higher ranking within each topic. Besides, the expanded model also has higher precision on the prediction set, which shows that breaking down the topics into two categories fits better on the data set than the standard LDA model. The product evaluation of each word is calculated by the expanded model and depicted at the end of the experiment.Originality/valueThis research provides a new method to calculate consumers’ product evaluation from reviews in the level of words. Words may be used to describe product attributes or consumers’ experiences in reviews. Assigning words with numerical measures can analyze consumers’ products evaluation quantitatively. Besides, words are labeled themselves, they can also be ranked if a numerical measure is given. Online retailers can benefit from the result for label choosing, advertising or product recommendation.
- Conference Article
28
- 10.1109/icassp.2007.367158
- Apr 1, 2007
We propose a Latent Dirichlet-Tree Allocation (LDTA) model- a correlated latent semantic model- for unsupervised language model adaptation. The LDTA model extends the Latent Dirichlet Allocation (LDA) model by replacing a Dirichlet prior with a Dirichlet-Tree prior over the topic proportions. Latent topics under the same subtree are expected to be more correlated than topics under different subtrees. The LDTA model falls back to the LDA model using a depth-one Dirichlet-Tree, and the model fits to the variational Bayes inference framework employed in the LDA model. Empirical results show that the LDTA model has a faster training convergence than the LDA model with the same initial flat model. Experimental results show that LDTA-adapted LM performed better than LDAadapted LM on the Mandarin RT04-eval set when the models were trained using a small text corpus, while both models had the same recognition performance when the models were trained using a big text corpus. We observed 0.4 % absolute CER reduction after LM adaptation using LSA marginals. Index Terms — correlated topics, Dirichlet-Tree, LSA, unsupervised LM adaptation
- Research Article
2
- 10.1007/s00520-024-08513-3
- Apr 29, 2024
- Supportive care in cancer : official journal of the Multinational Association of Supportive Care in Cancer
This study aimed to assess the different needs of patients with breast cancer and their families in online health communities at different treatment phases using a Latent Dirichlet Allocation (LDA) model. Using Python, breast cancer-related posts were collected from two online health communities: patient-to-patient and patient-to-doctor. After data cleaning, eligible posts were categorized based on the treatment phase. Subsequently, an LDA model identifying the distinct need-related topics for each phase of treatment, including data preprocessing and LDA topic modeling, was established. Additionally, the demographic and interactive features of the posts were manually analyzed. We collected 84,043 posts, of which 9504 posts were included after data cleaning. Early diagnosis and rehabilitation treatment phases had the highest and lowest number of posts, respectively. LDA identified 11 topics: three in the initial diagnosis phase and two in each of the remaining treatment phases. The topics included disease outcomes, diagnosis analysis, treatment information, and emotional support in the initial diagnosis phase; surgical options and outcomes, postoperative care, and treatment planning in the perioperative treatment phase; treatment options and costs, side effects management, and disease prognosis assessment in the non-operative treatment phase; diagnosis and treatment options, disease prognosis, and emotional support in the relapse and metastasis treatment phase; and follow-up and recurrence concerns, physical symptoms, and lifestyle adjustments in the rehabilitation treatment phase. The needs of patients with breast cancer and their families differ across various phases of cancer therapy. Therefore, specific information or emotional assistance should be tailored to each phase of treatment based on the unique needs of patients and their families.
- Research Article
4
- 10.14257/ijdta.2016.9.7.06
- Jul 31, 2016
- International Journal of Database Theory and Application
The hidden topic model of Chinese text, which possesses complicated semantics, is urgently needed, since China has occupied an increasingly significant role during the booming development of globalization over recent years. This paper details and elaborates the basic process of extracting latent Chinese topics by demonstrating a Chinese topic extraction schema based on Latent Dirichlet Allocation (LDA) model. Furthermore, the application was practiced in CCL, an authoritative Chinese corpus, to extract topics for its nine categories. With rigorous empirical analysis, extracting the LDA results has a considerably higher average precision rate as opposed to other three comparable Chinese topic extraction techniques; however the average recall rate is worse than KNN and almost the same with the PLSI model. Moreover, the recall rate and precision rate of LDA-CH is worse than LDA-EH. Therefore, the LDA model should be improved to adapt to the distinctive feature of Chinese words with the purpose of making it better for Chinese topic extraction.
- Research Article
- 10.52783/jes.4218
- Jun 1, 2024
- Journal of Electrical Systems
Background-Cooperate Social Responsibility (CSR) in supply chain management requires understanding customer psychological anxiety attributes. A data-driven approach, such as using a Latent Dirichlet Allocation (LDA) model, can provide insights. By recognizing and addressing customer psychological anxiety, CSR can provide better support, which leads to higher customer satisfaction. When customers feel understood and supported, they are more likely to have a positive perception of the company and its supply chain services. Subjects and methods- The study aims to use the LDA method to explore consumer psychology anxiety and its attributes for CSR in supply chain management. The corpus is collected from the Web of Science core collection with keywords “CSR” and “supply chain management”, and 965 articles related to the field from1990-2022. LDA is a natural language processing technique that uncovers thematic structures in textual data. By applying LDA to customer feedback, businesses can identify anxiety attributes. Steps include data collection, preprocessing, LDA model training, topic interpretation, and deriving business insights. Results- The study used the Python program to run LDA, after putting in the text data, the study identify number of 11 topics according to the value of topic coherence, then identified topics based on the most representative words or phrases within each topic combined CSR and supply chain related knowledge and fit the service quality model to find the customer psychological anxiety attributes, the results reveal that customer tend to be anxiety about the aspects of reliability and assurance for CSR in supply chain management, more concern about the environmental and social aspects responsibility. Conclusions- The study revealed customer psychological anxiety about CSR initiatives and strategies within their supply chain management. The results show that consumers feel most anxiety about the aspects of reliability and assurance of supply chain management. They are concerned about the environment and social responsibility that the supply chain enterprise is taking, especially in the food supply chain field. By addressing these concerns, organizations can enhance customer satisfaction, build stronger relationships, and improve their overall CSR performance. The findings of this study also contribute to the field by providing valuable guidance, further research can add other text mining methods like structural topic modelling, or add deeper quantitative research in the field and develop of a new customer-CSR service quality model in the service industry.
- Research Article
33
- 10.1080/09537325.2022.2130039
- Oct 4, 2022
- Technology Analysis & Strategic Management
Standard-essential patents (SEPs) are an important technological resource for firms in the telecommunication industry. The utilisation of technological topic analysis to reveal the global development dynamics of SEPs has significant theoretical and practical implications. First, this study defines the phrase extraction rules and constructs a phrase importance evaluation model to extract key technical phrases in the patent text. Second, the extracted key phrases are used as input for the Latent Dirichlet Allocation (LDA) model, and the relative independence (RI) model is proposed to determine the optimal number of topics based on two dimensions of coherence and similarity. Finally, the technological topic analysis based on the improved LDA model is performed on 30,154 texts of declared 5G SEPs. The results show that (1) the RI model can better identify the optimal number of topics for the LDA model; (2) 23 key technologies and four hot spots in 5G are identified based on the improved LDA model; (3) different firms have different technological layouts, and the diversification trend of technology development appears; and (4) the forecasting results also reveal the dynamics of emerging and declining technical areas in the 5G industry.