Extracting Parallel Sentences from Nonparallel Corpora Using Parallel Hierarchical Attention Network.

Shaolin Zhu,Chun Xu,Yong Yang

doi:10.1155/2020/8823906

Abstract

Collecting parallel sentences from nonparallel data is a long-standing natural language processing research problem. In particular, parallel training sentences are very important for the quality of machine translation systems. While many existing methods have shown encouraging results, they cannot learn various alignment weights in parallel sentences. To address this issue, we propose a novel parallel hierarchical attention neural network which encodes monolingual sentences versus bilingual sentences and construct a classifier to extract parallel sentences. In particular, our attention mechanism structure can learn different alignment weights of words in parallel sentences. Experimental results show that our model can obtain state-of-the-art performance on the English-French, English-German, and English-Chinese dataset of BUCC 2017 shared task about parallel sentences' extraction.

Highlights

Parallel sentences are a very important linguistic resource which comprises much text in the parallel translation of different languages
As the two methods of Maximum entropy classifier (ME) and Multilingual sentence embeddings (MSE) rely on feature engineering, alignment and bilingual words need a lot of manual annotation
In addition to long short-term memory (LSTM), which does not use a parallel attention mechanism, we show a significant increase in our proposed method

Summary

Introduction

Parallel sentences are a very important linguistic resource which comprises much text in the parallel translation of different languages. As collecting parallel sentences is important for improving the quality of machine translation systems, many works try to mine parallel sentences from comparable corpora in the last two decades. Traditional systems developed to extract parallel sentences from comparable corpora typically rely on multiple features or metadata from comparable corpora structure. The named entity is an important feature to measure source-target candidate parallel sentences. For English, CoreNLP (https:// stanfordnlp.github.io/CoreNLP/) can be implemented to extract English persons, locations, and organizations, while there are no open-source tools to deal with other lingual named entities such as Uyghur. To address those issues, many methods extracted parallel sentences without feature engineering. More recent approaches used deep learning, such as convolutional neural networks [13] and recurrent neural networks based on long short-term memory (LSTM) [1, 14, 15] to learn an end-to-end network classifier to filter parallel sentences

Methods

Results

Conclusion