Abstract

Objective: To compare the topic modeling techniques, as no free lunch theorem states that under a uniform distribution over search problems, all machine learning algorithms perform equally. Hence, here, we compare Latent Semantic Analysis (LSA) or Latent Dirichlet Allocation (LDA) to identify better performer for English bible data set which has not been studied yet. Methods: This comparative study divided into three levels: In the first level, bible data was extracted from the sources and preprocessed to remove the words and characters which were not useful to obtain the semantic structures or necessary patterns to make the meaningful corpus. In the second level, the preprocessed data were converted into a bag of words and numerical statistic TF-IDF (Term Frequency – Inverse Document Frequency) is used to assess how relevant a word is to a document in a corpus. In the third level, Latent Semantic analysis and Latent Dirichlet Allocations methods were applied over the resultant corpus to study the feasibility of the techniques. Findings: Based on our evaluation, we observed that the LDA achieves 60 to 75% superior performance when compared to LSA using document similarity within-corpus, document similarity with the unseen document. Additionally, LDA showed better coherence score (0.58018) than LSA (0.50395). Moreover, when compared to any word within-corpus, the word association showed better results with LDA. Some words have homonyms based on the context; for example, in the bible; bear has a meaning of punishment and birth. In our study, LDA word association results are almost near to human word associations when compared to LSA. Novelty: LDA was found to be the computationally efficient and interpretable method in adopting the English Bible dataset of New International Version that was not yet created. Keywords: Topic modeling; LSA; LDA; word association; document similarity;Bible data set

Highlights

  • There are many text mining methods to turn unstructured textual data into actionable information

  • We compared the performance of Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA) models with two baselines, cosine similarity and coherence score as the primary evaluation metrics

  • Because of document similarity within the corpus, entire documents were classified into four categories that are 0% to 25%, 26% to 50%, 51% to 75% and 76% to 100% similarity groups and chosen the documents from these groups and their most similar documents in similarity descending order and the same document were taken from the other method results and analyzed why the differences are shown between the results of two methods

Read more

Summary

Introduction

There are many text mining methods to turn unstructured textual data into actionable information. While traditional methods to analyze texts are limited in processing large amounts of data, some researchers have applied text mining to qualitative research projects. Due to these research advancements, text mining is viewed as a viable qualitative research method in machine learning and natural language processing efficiently [1,2,3]. These computer applications closely follow the paradigm of a common technique, topic modeling in the field of text mining. The topic models allow in analyzing a set of documents based on statistics of words in each, to express what the topic might be and what each document’s balance of topics. The significant and crucial step in the accuracy and storage of the information is quality management and extraction according to the information that is present

Results
Discussion
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.