Abstract
In this paper, we present information retrieval as a powerful tool for addressing an imperative problem in the field of statistical machine translation, i.e., improving translation quality when not enough parallel corpora are available. We devise a framework, which uses information retrieval to create a synthetic corpus from the easily available monolingual corpora. We propose an improved unsupervised training approach with a data selection mechanism, which selects only the most appropriate sentences, thus reducing the amount of data, which is less related to the domain in the additional bitext. We also introduce a new method to choose sentences based on their relative similarity/difference from the query sentence. Using the synthetic corpus created by our method, we are able to improve state-of-the-art statistical machine translation systems.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have