Abstract

We propose a novel method of automatic sentence alignment from noisy parallel documents. We first formalize the sentence alignment problem as the independent predictions of spans in the target document from sentences in the source document. We then introduce a total optimization method using integer linear programming to prevent span overlapping and obtain non-monotonic alignments. We implement cross-language span prediction by fine-tuning pre-trained multilingual language models based on BERT architecture and train them using pseudo-labeled data obtained from unsupervised sentence alignment method. While the baseline methods use sentence embeddings and assume monotonic alignment, our method can capture the token-to-token interaction between the tokens of source and target text and handle non-monotonic alignments. In sentence alignment experiments on English-Japanese, our method achieved 70.3 F1 scores, which are +8.0 points higher than the baseline method. In particular, our method improved by +53.9 F1 scores for extracting non-parallel sentences. Our method improved the downstream machine translation accuracy by 4.1 BLEU scores when the extracted bilingual sentences are used for fine-tuning a pre-trained Japanese-to-English translation model.

Highlights

  • Sentence alignment is the task that automatically extracts parallel sentences from noisy parallel documents

  • Parallel sentences are used to train cross-language models, especially for machine translation (MT) systems. Both the quantity and quality of the parallel sentences used for training are crucial for developing an accurate neural machine translation (NMT) system (Khayrallah and Koehn, 2018)

  • Our proposed method outperformed the previous work on all number of sentences pairs and the method using integer linear programming (ILP) achieved the highest F1 scores except for the 1-to-2 alignment pairs

Read more

Summary

Introduction

Sentence alignment is the task that automatically extracts parallel sentences from noisy parallel documents. Automatic sentence alignment methods using neural networks have gained popularity (Gregoire and Langlais, 2018; Artetxe and Schwenk, 2019a; Yang et al, 2019; Thompson and Koehn, 2019). Such systems have a scoring function to calculate how the two sentences are parallel from sentence embeddings and obtain an alignment hypothesis from these scores with an alignment algorithm. Quan et al (2013) described a legislation corpus as an example of non-monotonic alignments In such cases, the existing methods that assume monotonicity impair the accuracy

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call