Optimizing Text Clustering Efficiency through Flexible Latent Dirichlet Allocation Method: Exploring the Impact of Data Features and Threshold Modification

Erzsébet Tóth,Zoltán Gál

doi:10.36244/icj.2024.5.7

Abstract

A parallel corpus comprising Croatian EU legislative documents automatically translated into English spans 28 years and is enriched with metadata, including creation year and hierarchical classifier tags denoting descriptors, document types, and fields. However, nearly two-thirds of the approximately 1.5 thousand texts lack complete metadata, necessitating labor intensive manual efforts that pose challenges for human administration. This incompleteness issue can be observed in the case of official legal sites functioning as regular service provisioning databases. In response, this paper introduces an artificial cognitive and multilabel classification approach to expedite the tagging process with only a fraction of the manual effort. Leveraging the Latent Dirichlet Allocation (LDA) algorithm, our method assigns field values or tags to incompletely labeled documents. We implement a Flexible LDA variant, incorporating the influence of topics close to the most probable topic, regulated by a relative probability threshold (RPT). We evaluate the LDA prediction's dependence on document prefiltering and RPT values. Furthermore, we investigate the dependence of quantitative linguistic properties on the type and speciality of pre-processing tasks. Our algorithm, built on error-correcting optimizing codes, succesfully predicts a mixture of topic probabilities for these legal texts. This prediction is achieved by calculating the Hamming distance of binary feature vectors created using the legal fields of the EUROVOC multilingual thesaurus.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Optimizing Text Clustering Efficiency through Flexible Latent Dirichlet Allocation Method: Exploring the Impact of Data Features and Threshold Modification

Abstract

Talk to us

Similar Papers

More From: Infocommunications journal

Lead the way for us

Similar Papers

Comparative Study on Perceived Trust of Topic Modeling Based on Affective Level of Educational Text
Youngjae Im ... Kijung Park
Applied Sciences | VOL. 9
Youngjae Im, et. al.Youngjae Im ... Kijung Park
28 Oct 2019
Applied Sciences | VOL. 9

Indonesia's News Topic Discussion about Covid-19 Outbreak using Latent Dirichlet Allocation
Razief Perucha Fauzie Afidh ... Zainal A Hasibuan
-
Razief Perucha Fauzie Afidh, et. al.Razief Perucha Fauzie Afidh ... Zainal A Hasibuan
03 Nov 2020
03 Nov 2020

CitationLDA++
Thuc Nguyen ... Phuc Do
-
Thuc Nguyen, et. al.Thuc Nguyen ... Phuc Do
01 Jan 2018
01 Jan 2018

An effective hot topic detection method for microblog on spark
Wei Ai ... Keqin Li
Applied Soft Computing | VOL. 70
Wei Ai, et. al.Wei Ai ... Keqin Li
07 Oct 2017
Applied Soft Computing | VOL. 70

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Optimizing Text Clustering Efficiency through Flexible Latent Dirichlet Allocation Method: Exploring the Impact of Data Features and Threshold Modification

Abstract

Talk to us

Similar Papers

More From: Infocommunications journal