CLOSING THE VOCABULARY GAP FOR COMPUTING TEXT SIMILARITY AND INFORMATION RETRIEVAL

Christof Müller,Max Mühlhäuser,Iryna Gurevych

doi:10.1142/s1793351x08000452

Abstract

This paper studies the integration of lexical semantic knowledge in two related semantic computing tasks: ad-hoc information retrieval and computing text similarity. For this purpose, we compare the performance of two algorithms: (i) using semantic relatedness, and (ii) using a conventional extended Boolean model [13] with additional query expansion. For the evaluation, we use two different test collections in the German language especially suitable to study the vocabulary gap problem: (i) GIRT [5] for the information retrieval task, and (ii) a collection of descriptions of professions built to evaluate a system for electronic career guidance in the information retrieval and text similarity tasks. We found that integrating lexical semantic knowledge increases the performance for both tasks. On the GIRT corpus, the performance is improved only for short queries. The performance on the collection of professional descriptions is improved, but crucially depends on the accurate preprocessing of the natural language essays employed as topics.

Full Text