Finite State Automata on Multi-Word Units for Efficient Text-Mining

Alberto Postiglione

doi:10.3390/math12040506

Abstract

Text mining is crucial for analyzing unstructured and semi-structured textual documents. This paper introduces a fast and precise text mining method based on a finite automaton to extract knowledge domains. Unlike simple words, multi-word units (such as credit card) are emphasized for their efficiency in identifying specific semantic areas due to their predominantly monosemic nature, their limited number and their distinctiveness. The method focuses on identifying multi-word units within terminological ontologies, where each multi-word unit is associated with a sub-domain of ontology knowledge. The algorithm, designed to handle the challenges posed by very long multi-word units composed of a variable number of simple words, integrates user-selected ontologies into a single finite automaton during a fast pre-processing step. At runtime, the automaton reads input text character by character, efficiently locating multi-word units even if they overlap. This approach is efficient for both short and long documents, requiring no prior training. Ontologies can be updated without additional computational costs. An early system prototype, tested on 100 short and medium-length documents, recognized the knowledge domains for the vast majority of texts (over 90%) analyzed. The authors suggest that this method could be a valuable semantic-based knowledge domain extraction technique in unstructured documents.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Mathematics	Publication Date: Feb 6, 2024
Citations: 1	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Finite State Automata on Multi-Word Units for Efficient Text-Mining

Abstract

Talk to us

Similar Papers

More From: Mathematics

Lead the way for us

Similar Papers

Classification and Knowledge Organization Systems: ontologies and archival classification
Thiago Henrique Bragato Barros ... Daniel Libonati Gomes
-
Thiago Henrique Bragato Barros, et. al.Thiago Henrique Bragato Barros ... Daniel Libonati Gomes
01 Jan 2018
01 Jan 2018

When is the Time Ripe for Natural Language Processing for Patent Passage Retrieval?
Linda Andersson ... Mihai Lupu
-
Linda Andersson, et. al.Linda Andersson ... Mihai Lupu
24 Oct 2016
24 Oct 2016

Computational Inflection of Multi-Word Units
Agata Savary
Linguistic Issues in Language Technology | VOL. 1
Agata SavaryAgata Savary
01 Jul 2008
Linguistic Issues in Language Technology | VOL. 1

Using Text Mining and Natural Language Processing to Support Business Decision: Towards a NooJ Application
Francesca Esposito ... Maddalena Della Volpe
-
Francesca Esposito, et. al.Francesca Esposito ... Maddalena Della Volpe
01 Jan 2015
01 Jan 2015

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Finite State Automata on Multi-Word Units for Efficient Text-Mining

Abstract

Talk to us

Similar Papers

More From: Mathematics