Navigable Topic Maps for overlaying multiple acquired semantic classifications

Helka Folch,Saadi Lahlou,Benoît Habert Benoît Habert

doi:10.1162/109966200750363625

Abstract

We present work carried out within the framework of the Scriptorium project, developed at the Research & Development division of Electricite de France (EDF), the French electricity company. We are exploring issues related to knowledge acquisition from very large, heterogeneous corpora, and to the semantic annotation of these corpora, with the aim of facilitating browsing and navigation. Semantic access to heterogeneous, evolving text collections has become a crucial issue today in the world of online information: the increasing availability of electronic text enables the construction (and dispersion) of heterogeneous text collections. Current navigation tools such as thesauri, glossaries, indexes, etc., based on pre-defined semantic categories or taxonomies are inadequate for describing or browsing this kind of dynamic, loosely structured text collections. We therefore have adopted an inductive, data-driven approach aimed at extracting semantic classes from a corpus through the statistical analysis of textual data. We create different views or 'slices' of the document collection by extracting sub-corpora of manageable size, which we submit to the statistical software. We then build a navigable topic map of our document collection using the Topic Map Standard (ISO/IEC 13250) which provides a semantic interface to the document collection and enables navigation through the viewpoints and classes inductively acquired. Navigation is aided by a 3D geometric representation of the semantic space of the corpus... The aim of this project is to identify prominent and emerging topics from the automatic analysis of the discourse of the company's (EDF's) different social agents (managers, trade-unions, employees, etc.) by way of textual data analysis methods. The corpus under study in this project has eight million words and is very heterogeneous (it contains book extracts, corporate press, union press, summaries of corporate meetings, transcriptions of taped trade union messages, etc.). This diversity makes this corpus prototypical of the electronic documents available nowadays in a given domain. All documents are SGML tagged following the TEI (Text Encoding Initiative) recommendations... We are exploring issues related to semantic acquisition from large, heterogeneous corpora and content-based access to these corpora on the basis of inductively-acquired categories. We feel that data-driven, inductive approaches for building semantic interfaces to text collections will become more and more necessary, to efficiently manage the unrestricted, dynamic online information available today.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Navigable Topic Maps for overlaying multiple acquired semantic classifications

Abstract

Talk to us

Similar Papers

More From: Markup Languages: Theory and Practice

Lead the way for us

Journal: Markup Languages: Theory and Practice	Publication Date: Aug 1, 2000
Citations: 2

Similar Papers

Practical Considerations in the Use of TEI Headers in a Large Corpus
Dominic Dunlop
-
Dominic DunlopDominic Dunlop
01 Jan 1995
01 Jan 1995

Assessment and analysis of information quality: a multidimensional model and case studies
Laure Berti Équille ... Samira Si Saïd Cherfi
International Journal of Information Quality | VOL. 2
Laure Berti Équille, et. al.Laure Berti Équille ... Samira Si Saïd Cherfi
01 Jan 2010
International Journal of Information Quality | VOL. 2

The CLiGS Textbox: Building and Using Collections of Literary Texts in Romance Languages Encoded in TEI XML
Christof Schöch ... Stefanie Popp
Journal of the Text Encoding Initiative | VOL. -
Christof Schöch, et. al.Christof Schöch ... Stefanie Popp
14 Aug 2019
Journal of the Text Encoding Initiative | VOL. -

Monte Carlo Vehicle Routing

-

25 Aug 2020
25 Aug 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Navigable Topic Maps for overlaying multiple acquired semantic classifications

Abstract

Talk to us

Similar Papers

More From: Markup Languages: Theory and Practice