Text mining of bilingual parallel corpora with a measure of semantic similarity

Chung-Hong Lee Chung-Hong Lee,Hsin-Chang Yang Hsin-Chang Yang

doi:10.1109/icsmc.2001.969857

Abstract

The paper describes a new application of a text-mining algorithm to the text sources of bilingual parallel corpora. The ultimate task, being undertaken in the context of a Chinese-English machine translation project, will be to develop a language-neutral method to discover similar documents from multilingual text collections. Using a variation of automatic clustering techniques which apply a neural net approach, namely the self-organizing maps (SOM), we have conducted several experiments to uncover associated documents based on Chinese-English bilingual parallel corpora, and a hybrid Chinese-English corpus. The experiments show some interesting results and a couple of potential ways for future work towards the field of multilingual information discovery. In addition, for exploring the impacts on linguistic issues with the machine learning approach to mining sensible linguistics elements from multilingual texts, we have examined the resulting term associations and text associations from the view of cross-lingual text similarity. To evaluate semantic relatedness of the mined bilingual texts, we applied a measure technique of semantic similarity in the resulting bilingual document clusters and word clusters. The paper presents algorithms that enable multilingual text mining based on the self-organizing map (SOM) for automatically grouping similar multilingual texts (i.e. Chinese and English texts), along with a means of measuring their semantic similarity to resolve the difficulties of syntactic and semantic ambiguity in multilingual information access.

Full Text