Abstract

When performing cross-language information retrieval (CLIR) for lower-resourced languages, a common approach is to retrieve over the output of machine translation (MT). However, there is no established guidance on how to optimize the resulting MT-IR system. In this paper, we examine the relationship between the performance of MT systems and both neural and term frequency-based IR models to identify how CLIR performance can be best predicted from MT quality. We explore performance at varying amounts of MT training data, byte pair encoding (BPE) merge operations, and across two IR collections and retrieval models. We find that the choice of IR collection can substantially affect the predictive power of MT tuning decisions and evaluation, potentially introducing dissociations between MT-only and overall CLIR performance.

Highlights

  • For cross-language information retrieval scenarios involving queries in a higher-resourced language and documents in a lower-resourced language, direct training of a cross-language IR system (Litschko et al, 2018; Sasaki et al, 2018; Vulicand Moens, 2015) is typically infeasible due to insufficient data in the document language.A practical solution is to use machine translation to translate documents into the language of the queries, enabling the use of a traditional monolingual IR system trained on the higher-resourced language

  • Using larger amounts of machine translation (MT) training data leads to higher MT and end-to-end performance; we focus on the subtleties of performance and not this generality

  • We address our first question: when making MT tuning decisions such as selecting the number of byte pair encoding (BPE) merge operations, will a value that improves MT performance generally increase IR performance? We explore tuning BPE because it is commonly used in modern neural MT systems, and it is likely to have substantial end-to-end impact as it controls the subword input and output vocabulary

Read more

Summary

Experiment Design

For cross-language information retrieval scenarios involving queries in a higher-resourced language and documents in a lower-resourced language, direct training of a cross-language IR system (Litschko et al, 2018; Sasaki et al, 2018; Vulicand Moens, 2015) is typically infeasible due to insufficient data in the document language. A practical solution is to use machine translation to translate documents into the language of the queries, enabling the use of a traditional monolingual IR system trained on the higher-resourced language. To address the above questions, we trained an MT system in a large number of configurations, used these models to produce translated English collections of varying quality, and compared the performance of retrieval over translated collections to retrieval over the matching source English documents. While an MT-IR system of this type is most appropriately used on lower-resourced languages, the resources needed to perform such a study using publicly available data far only exist in higher-resourced languages. To simulate a lower-resourced setting, we used a small portion of the MT training data available from higher-resourced languages, ablated it into smaller subsets, and did not rely on any other language resources. The software and data required to replicate the experiments reported here are available from https: //github.com/ConstantineLignos/ mt-clir-emnlp-2019

Machine Translation
Information Retrieval
Impact of BPE
Evaluating MT for End-to-end Performance
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call