A comparative analysis of Latent Semantic analysis and Latent Dirichlet allocation topic modeling methods using Bible data

Vasantha Kumari Garbhapu

doi:10.17485/ijst/v13i44.1479

Abstract

Objective: To compare the topic modeling techniques, as no free lunch theorem states that under a uniform distribution over search problems, all machine learning algorithms perform equally. Hence, here, we compare Latent Semantic Analysis (LSA) or Latent Dirichlet Allocation (LDA) to identify better performer for English bible data set which has not been studied yet. Methods: This comparative study divided into three levels: In the first level, bible data was extracted from the sources and preprocessed to remove the words and characters which were not useful to obtain the semantic structures or necessary patterns to make the meaningful corpus. In the second level, the preprocessed data were converted into a bag of words and numerical statistic TF-IDF (Term Frequency – Inverse Document Frequency) is used to assess how relevant a word is to a document in a corpus. In the third level, Latent Semantic analysis and Latent Dirichlet Allocations methods were applied over the resultant corpus to study the feasibility of the techniques. Findings: Based on our evaluation, we observed that the LDA achieves 60 to 75% superior performance when compared to LSA using document similarity within-corpus, document similarity with the unseen document. Additionally, LDA showed better coherence score (0.58018) than LSA (0.50395). Moreover, when compared to any word within-corpus, the word association showed better results with LDA. Some words have homonyms based on the context; for example, in the bible; bear has a meaning of punishment and birth. In our study, LDA word association results are almost near to human word associations when compared to LSA. Novelty: LDA was found to be the computationally efficient and interpretable method in adopting the English Bible dataset of New International Version that was not yet created. Keywords: Topic modeling; LSA; LDA; word association; document similarity;Bible data set

Highlights

There are many text mining methods to turn unstructured textual data into actionable information
We compared the performance of Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA) models with two baselines, cosine similarity and coherence score as the primary evaluation metrics
Because of document similarity within the corpus, entire documents were classified into four categories that are 0% to 25%, 26% to 50%, 51% to 75% and 76% to 100% similarity groups and chosen the documents from these groups and their most similar documents in similarity descending order and the same document were taken from the other method results and analyzed why the differences are shown between the results of two methods

Summary

Introduction

There are many text mining methods to turn unstructured textual data into actionable information. While traditional methods to analyze texts are limited in processing large amounts of data, some researchers have applied text mining to qualitative research projects. Due to these research advancements, text mining is viewed as a viable qualitative research method in machine learning and natural language processing efficiently [1,2,3]. These computer applications closely follow the paradigm of a common technique, topic modeling in the field of text mining. The topic models allow in analyzing a set of documents based on statistics of words in each, to express what the topic might be and what each document’s balance of topics. The significant and crucial step in the accuracy and storage of the information is quality management and extraction according to the information that is present

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Indian Journal of Science and Technology	Publication Date: Nov 20, 2020
Citations: 4	License type: cc-by

R Discovery Prime

R Discovery Prime

A comparative analysis of Latent Semantic analysis and Latent Dirichlet allocation topic modeling methods using Bible data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Indian Journal of Science and Technology

Lead the way for us

Similar Papers

TOPIC MODELING IN COVID-19 VACCINATION REFUSAL CASES USING LATENT DIRICHLET ALLOCATION AND LATENT SEMANTIC ANALYSIS
Ulfah Malihatin S ... Uce Indahyanti
Jurnal Teknik Informatika (Jutif) | VOL. 4
Ulfah Malihatin S, et. al.Ulfah Malihatin S ... Uce Indahyanti
03 Oct 2023
Jurnal Teknik Informatika (Jutif) | VOL. 4

Improve topic modeling algorithms based on Twitter hashtags
Hayder M Alash ... Ghaidaa A Al-Sultany
Journal of Physics: Conference Series | VOL. 1660
Hayder M Alash, et. al.Hayder M Alash ... Ghaidaa A Al-Sultany
01 Nov 2020
Journal of Physics: Conference Series | VOL. 1660

Keyword Extraction – Comparison of Latent Dirichlet Allocation and Latent Semantic Analysis
Bhuvaneshwari Kondeti ... Haragopal V V
European Journal of Mathematics and Statistics | VOL. 3
Bhuvaneshwari Kondeti, et. al.Bhuvaneshwari Kondeti ... Haragopal V V
13 Jun 2022
European Journal of Mathematics and Statistics | VOL. 3

Topic Modelling in Knowledge Management Documents BPS Statistics Indonesia
Muhammad Yunus Hendrawan ... Nucke Widowati Kusumo Projo
Proceedings of The International Conference on Data Science and Official Statistics | VOL. 2021
Muhammad Yunus Hendrawan, et. al.Muhammad Yunus Hendrawan ... Nucke Widowati Kusumo Projo
04 Jan 2022
Proceedings of The International Conference on Data Science and Official Statistics | VOL. 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A comparative analysis of Latent Semantic analysis and Latent Dirichlet allocation topic modeling methods using Bible data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Indian Journal of Science and Technology