Text mining of multilingual corpora via computing semantic relatedness

Chung-Hong Lee Chung-Hong Lee,Hsin-Chang Yang Hsin-Chang Yang

doi:10.1109/icsmc.2002.1176326

Abstract

This paper describes a new application of a text-mining algorithm to the text sources of bilingual corpora. In the past, the majority of the approaches applied to measuring semantic relatedness was based on edge counting methods through a semantic network, such as WordNet. It is not well suited for applications in specific domains in which the standard lexical knowledge bases are not available. In this work, we propose an alternative solution for acquisition of semantic relatedness from text corpora by means of a machine learning technique, namely the self-organizing maps. This paper presents a hybrid approach to discovering a concept-based feature map containing word clusters and document clusters from multilingual text collections. Using SOM-based automatic clustering techniques, we have conducted several experiments to uncover associated documents based on Chinese-English bilingual parallel corpora, and a hybrid Chinese-English corpus. In essence, this work provides a method for automatic text clustering, which resolves some of the language difficulties in concept discovery and categorization from multilingual text corpora.

Full Text