Abstract

Word alignment is an important task of detecting translation equivalents between a sentence pair. Although word alignment is no longer necessarily needed for neural machine translation, it’s still useful in a wealth of applications, e.g., bilingual lexicon induction, constraint decoding, and so on. However, the most well-known word aligners are still Giza++ and fastAlign, both of which are implementations of traditional IBM models. To keep pace with the advance in NMT, there has been a surge of interest in replacing the IBM models with neural models. We follow this trend but aim to boost performance of word alignment between Japanese and Chinese, which share a large portion of Chinese characters. Our key idea is to leverage these common Chinese characters in both languages as an indicator for inferring alignment; i.e., the source and target words with the common Chinese characters should be most likely aligned. Following this idea, we propose three methods that leverage common Chinese characters to boost the mBERT-based word alignment, including reward factor, representation alignment, and contrastive training. Furthermore, we annotate and release a golden dataset for Japanese-Chinese word alignment. Experiments on the dataset show that our methods outperform several strong baselines in terms of AER score and verify the effectiveness of exploiting common Chinese characters.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call