Abstract

Parallel corpus mining (PCM) is beneficial for many corpus-based natural language processing tasks, e.g., machine translation and bilingual dictionary induction, especially in low-resource languages and domains. It relies heavily on cross-lingual representations to model the interdependencies between different languages and determine whether sentences are parallel or not. In this paper, we take the first step towards exploiting the multilingual Transformer translation model to produce expressive sentence representations for PCM. Since the traditional Transformer lacks an immediate sentence representation, we pool the output representation of the encoder as the sentence representation, which is further optimized as a part of the training flow of the translation model. Experiments conducted on the BUCC PCM task show that the proposed method improves mining performance over the existing methods with the assistance of the pre-trained multilingual BERT. To further test the usability of the proposed method, we mine parallel sentences from public resources and find that the mined sentences can indeed enhance low-resource machine translation.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.