Abstract

We present an approach to cross-language text retrieval based on the EuroWordNet (EWN) multilingual semantic database. EuroWordNet is a multilingual, WordNet-like database with basic semantic relations between words for several European languages (English, Dutch, Spanish, Italian, German, French, Czech, and Estonian). In addition to the relations in WordNet 1.5, EWN includes domain labels, cross-language, and cross-part-of-speech relations, which are directly useful for multilingual information retrieval. In our approach, documents in any language covered by EuroWordNet are indexed in a space of language-independent concepts (the EuroWordNet Inter Lingual Index), thus turning term weighting and query/document matching into language-independent tasks. We report on the results of a number of experiments that measure the potential benefits of the approach and its tolerance to word sense disambiguation errors. In our monolingual experiments, the classical, vector space model for text retrieval is shown to give better results (up to 29% better in our experiments) if WordNet synsets are chosen as the indexing space, instead of word forms. This result is obtained for a manually disambiguated test collection derived from the SEMCOR annotated corpus. The sensitivity of retrieval performance to (automatic) disambiguation errors is also measured. Our preliminary bilingual experiments, also reported here, show that our approach can sensibly outperform a naive, dictionary-based, translation of the query terms into the target language.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.