Abstract

Today's amount of user-generated, multilingual textual data generates the necessity for information processing systems, where cross-linguality, i.e the ability to work on more than one language, is fully integrated into the underlying models. In the particular context of Information Retrieval (IR), this amounts to rank and retrieve relevant documents from a large repository in language A, given a user's information need expressed in a query in language B. This kind of application is commonly termed a Cross-Language Information Retrieval (CLIR) system. Such CLIR systems typically involve a translation component of varying complexity, which is responsible for translating the user input into the document language. Using query translations from modern, phrase-based Statistical Machine Translation (SMT) systems, and subsequently retrieving monolingually is thus a straightforward choice. However, the amount of work committed to integrate such SMT models into CLIR, or even jointly model translation and retrieval, is rather small. In this thesis, I focus on the shared aspect of in translation-based CLIR: Both, translation and retrieval models, induce rankings over a set of candidate structures through assignment of scores. The subject of this thesis is to exploit this commonality in three different tasks: (1) Mate-ranking refers to the task of mining comparable data for SMT domain adaptation through translation-based CLIR. Cross-lingual mates are direct or close translations of the query. I will show that such a CLIR system is able to find in-domain comparable data from noisy user-generated corpora and improves in-domain translation performance of an SMT system. Conversely, the CLIR system relies itself on a translation model that is tailored for retrieval. This leads to the second direction of research, in which I develop two ways to optimize an SMT model for retrieval, namely (2) by SMT parameter optimization towards a retrieval objective (translation ranking), and (3) by presenting a joint model of translation and retrieval for document ranking. The latter abandons the common architecture of modeling both components separately. The former task refers to optimizing for preference of translation candidates that work well for retrieval. In the core task of document ranking for CLIR, I present a model that directly ranks documents using an SMT decoder. I present substantial improvements over state-of-the-art translation-based CLIR baseline systems, indicating that a joint model of translation and retrieval is a promising direction of research in the field of CLIR.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call