The Challenges of Optimizing Machine Translation for Low Resource Cross-Language Information Retrieval

Constantine Lignos,Yen-Chieh Lien,Scott Miller,Daniel Cohen,W Bruce Croft,Pratik Mehta

doi:10.18653/v1/d19-1353

Abstract

When performing cross-language information retrieval (CLIR) for lower-resourced languages, a common approach is to retrieve over the output of machine translation (MT). However, there is no established guidance on how to optimize the resulting MT-IR system. In this paper, we examine the relationship between the performance of MT systems and both neural and term frequency-based IR models to identify how CLIR performance can be best predicted from MT quality. We explore performance at varying amounts of MT training data, byte pair encoding (BPE) merge operations, and across two IR collections and retrieval models. We find that the choice of IR collection can substantially affect the predictive power of MT tuning decisions and evaluation, potentially introducing dissociations between MT-only and overall CLIR performance.

Highlights

For cross-language information retrieval scenarios involving queries in a higher-resourced language and documents in a lower-resourced language, direct training of a cross-language IR system (Litschko et al, 2018; Sasaki et al, 2018; Vulicand Moens, 2015) is typically infeasible due to insufficient data in the document language.A practical solution is to use machine translation to translate documents into the language of the queries, enabling the use of a traditional monolingual IR system trained on the higher-resourced language
Using larger amounts of machine translation (MT) training data leads to higher MT and end-to-end performance; we focus on the subtleties of performance and not this generality
We address our first question: when making MT tuning decisions such as selecting the number of byte pair encoding (BPE) merge operations, will a value that improves MT performance generally increase IR performance? We explore tuning BPE because it is commonly used in modern neural MT systems, and it is likely to have substantial end-to-end impact as it controls the subword input and output vocabulary

Summary

Experiment Design

For cross-language information retrieval scenarios involving queries in a higher-resourced language and documents in a lower-resourced language, direct training of a cross-language IR system (Litschko et al, 2018; Sasaki et al, 2018; Vulicand Moens, 2015) is typically infeasible due to insufficient data in the document language. A practical solution is to use machine translation to translate documents into the language of the queries, enabling the use of a traditional monolingual IR system trained on the higher-resourced language. To address the above questions, we trained an MT system in a large number of configurations, used these models to produce translated English collections of varying quality, and compared the performance of retrieval over translated collections to retrieval over the matching source English documents. While an MT-IR system of this type is most appropriately used on lower-resourced languages, the resources needed to perform such a study using publicly available data far only exist in higher-resourced languages. To simulate a lower-resourced setting, we used a small portion of the MT training data available from higher-resourced languages, ablated it into smaller subsets, and did not rely on any other language resources. The software and data required to replicate the experiments reported here are available from https: //github.com/ConstantineLignos/ mt-clir-emnlp-2019

Machine Translation

Information Retrieval

Impact of BPE

Evaluating MT for End-to-end Performance

Conclusions