Hidden data states-based complex terminology extraction from textual web data model

Fethi Fkih,Mohamed Nazih Omri

doi:10.1007/s10489-019-01568-4

Abstract

In order to respect the standards of the “semantic web” which allows the data to be shared and reused between several applications, it became necessary to model web text documents with a vision based on the concepts and exploit available linguistic resources. It’s evident that the extraction of semantic tokens ensures semantic modelling of web documents. Unfortunately, terminology extraction techniques from unstructured Web text remain unable to provide powerful results. Indeed, systems developed based on the classical techniques extract massively high amounts of candidate terms and leave the task of separation between relevant and irrelevant candidates for post-processing. In this paper, we introduce HMM-Extract a novel model for terminology retrieval based on Markov model. Our model integrates two modules that work in cascade: a module based on Hidden Markov Model (HMM) for complex term extraction and a module based on Markov Chain for filtering terms provided by the HMM. Thus, we try to focus on three main contributions: firstly, we provide a linguistic and statistical specification of relevant terms. Secondly, we show the possibility of using a HMM to extract relevant terms from unstructured textual documents. Finally, we prove the importance of integrating statistical knowledge in a Markov Chain and we show, experimentally, its contribution to the field of terminology extraction.

Full Text