Bertalign: Improved word embedding-based sentence alignment for Chinese–English parallel corpora of literary texts

Lei Liu,Min Zhu

doi:10.1093/llc/fqac089

Abstract

Abstract Bertalign is designed to improve sentence alignment accuracy for Chinese–English parallel corpora of literary texts. Aligning bilingual literary texts is not trivial, since most of the translation is interpretative and not based on 1-to-1 mappings between source and target sentences. Existing alignment methods highlight 1-to-1 links while having difficulty coping with 1-to-many and many-to-many alignments that are common in literary texts. To overcome the weaknesses of current approaches, we propose a novel two-step algorithm for bilingual sentence alignment. The first step finds the optimal paths for 1-to-1 alignments based on the top-k most semantically similar target sentences for each source sentence using the bidirectional encoder representations from transformer-based cross-lingual word embeddings. The second step relies on search paths found in the previous step to recover all valid alignments with more than one sentence on each side of the bilingual text. A comprehensive experiment was conducted on a newly built Chinese–English literary parallel corpus and a large-scale publicly available bilingual corpus of the Bible to compare the performance of Bertalign with five baseline systems: Gale-Church, Hunalign, Bleualign, Bleurtalign, and Vecalign. The results show that Bertalign achieves the highest accuracy in terms of F1 score on the two evaluation datasets than previous methods.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Bertalign: Improved word embedding-based sentence alignment for Chinese–English parallel corpora of literary texts

Abstract

Talk to us

Similar Papers

More From: Digital Scholarship in the Humanities

Lead the way for us

Journal: Digital Scholarship in the Humanities	Publication Date: Dec 29, 2022
Citations: 2

Similar Papers

Sliding Window and Parallel LSTM with Attention and CNN for Sentence Alignment on Low-Resource Languages
Tien-Ping Tan ... Wan Rose Eliza Abdul Rahman
Pertanika Journal of Science and Technology | VOL. 30
Tien-Ping Tan, et. al.Tien-Ping Tan ... Wan Rose Eliza Abdul Rahman
24 Nov 2021
Pertanika Journal of Science and Technology | VOL. 30

Evaluating automatic sentence alignment approaches on English-Slovak sentences
Frantisek Forgac ... Livia Kelebercova
Scientific Reports | VOL. 13
Frantisek Forgac, et. al.Frantisek Forgac ... Livia Kelebercova
17 Nov 2023
Scientific Reports | VOL. 13

Multilingual Neural Translation

-

14 Feb 2020
14 Feb 2020

Multiple Factors Integration Based Text Alignment for Medical Database
Hong Lin Wu ... Yi Yang Liu
Advanced Materials Research | VOL. 756-759
Hong Lin Wu, et. al.Hong Lin Wu ... Yi Yang Liu
01 Sep 2013
Advanced Materials Research | VOL. 756-759

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Bertalign: Improved word embedding-based sentence alignment for Chinese–English parallel corpora of literary texts

Abstract

Talk to us

Similar Papers

More From: Digital Scholarship in the Humanities