Abstract

Abstract Bertalign is designed to improve sentence alignment accuracy for Chinese–English parallel corpora of literary texts. Aligning bilingual literary texts is not trivial, since most of the translation is interpretative and not based on 1-to-1 mappings between source and target sentences. Existing alignment methods highlight 1-to-1 links while having difficulty coping with 1-to-many and many-to-many alignments that are common in literary texts. To overcome the weaknesses of current approaches, we propose a novel two-step algorithm for bilingual sentence alignment. The first step finds the optimal paths for 1-to-1 alignments based on the top-k most semantically similar target sentences for each source sentence using the bidirectional encoder representations from transformer-based cross-lingual word embeddings. The second step relies on search paths found in the previous step to recover all valid alignments with more than one sentence on each side of the bilingual text. A comprehensive experiment was conducted on a newly built Chinese–English literary parallel corpus and a large-scale publicly available bilingual corpus of the Bible to compare the performance of Bertalign with five baseline systems: Gale-Church, Hunalign, Bleualign, Bleurtalign, and Vecalign. The results show that Bertalign achieves the highest accuracy in terms of F1 score on the two evaluation datasets than previous methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call