Exploiting Translation Model for Parallel Corpus Mining

Chongman Leong,Derek F Wong,Xuebo Liu,Lidia S Chao

doi:10.1109/taslp.2021.3105798

Abstract

Parallel corpus mining (PCM) is beneficial for many corpus-based natural language processing tasks, e.g., machine translation and bilingual dictionary induction, especially in low-resource languages and domains. It relies heavily on cross-lingual representations to model the interdependencies between different languages and determine whether sentences are parallel or not. In this paper, we take the first step towards exploiting the multilingual Transformer translation model to produce expressive sentence representations for PCM. Since the traditional Transformer lacks an immediate sentence representation, we pool the output representation of the encoder as the sentence representation, which is further optimized as a part of the training flow of the translation model. Experiments conducted on the BUCC PCM task show that the proposed method improves mining performance over the existing methods with the assistance of the pre-trained multilingual BERT. To further test the usability of the proposed method, we mine parallel sentences from public resources and find that the mined sentences can indeed enhance low-resource machine translation.

Full Text