Multilingual BERT-based Word Alignment By Incorporating Common Chinese Characters

Zezhong Li,Piao Shi,Fuji Ren,Degen Huang,Xiao Sun,Jianjun Ma

doi:10.1145/3594634

Abstract

Word alignment is an important task of detecting translation equivalents between a sentence pair. Although word alignment is no longer necessarily needed for neural machine translation, it’s still useful in a wealth of applications, e.g., bilingual lexicon induction, constraint decoding, and so on. However, the most well-known word aligners are still Giza++ and fastAlign, both of which are implementations of traditional IBM models. To keep pace with the advance in NMT, there has been a surge of interest in replacing the IBM models with neural models. We follow this trend but aim to boost performance of word alignment between Japanese and Chinese, which share a large portion of Chinese characters. Our key idea is to leverage these common Chinese characters in both languages as an indicator for inferring alignment; i.e., the source and target words with the common Chinese characters should be most likely aligned. Following this idea, we propose three methods that leverage common Chinese characters to boost the mBERT-based word alignment, including reward factor, representation alignment, and contrastive training. Furthermore, we annotate and release a golden dataset for Japanese-Chinese word alignment. Experiments on the dataset show that our methods outperform several strong baselines in terms of AER score and verify the effectiveness of exploiting common Chinese characters.

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Multilingual BERT-based Word Alignment By Incorporating Common Chinese Characters

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Asian and Low-Resource Language Information Processing

Lead the way for us

Journal: ACM Transactions on Asian and Low-Resource Language Information Processing	Publication Date: Jun 19, 2023
Citations: 1

Similar Papers

Integrating Word Embeddings into IBM Word Alignment Models
Anh-Cuong Le ... Tuan-Phong Nguyen
-
Anh-Cuong Le, et. al.Anh-Cuong Le ... Tuan-Phong Nguyen
01 Nov 2018
01 Nov 2018

Segmenting Long Sentence Pairs to Improve Word Alignment in English-Hindi Parallel Corpora
Jyoti Srivastava ... Sudip Sanyal
-
Jyoti Srivastava, et. al.Jyoti Srivastava ... Sudip Sanyal
01 Jan 2012
01 Jan 2012

A Bidirectional Transformer Based Alignment Model for Unsupervised Word Alignment
...
-
, et. al. ...
01 Aug 2021
01 Aug 2021

WA-Continuum: Visualising Word Alignments across Multiple Parallel Sentences Simultaneously
David Steele ... Lucia Specia
-
David Steele, et. al.David Steele ... Lucia Specia
01 Jan 2015
01 Jan 2015

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Multilingual BERT-based Word Alignment By Incorporating Common Chinese Characters

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Asian and Low-Resource Language Information Processing