Weighted Set-Theoretic Alignment of Comparable Sentences

Andoni Azpeitia,Eva Martínez Garcia,Thierry Etchegoyhen

doi:10.18653/v1/w17-2508

Abstract

This article presents the STACCw system for the BUCC 2017 shared task on parallel sentence extraction from comparable corpora. The original STACC approach, based on set-theoretic operations over bags of words, had been previously shown to be efficient and portable across domains and alignment scenarios. Wedescribe an extension of this approach with a new weighting scheme and show that it provides significant improvements on the datasets provided for the shared task.

Highlights

Parallel corpora are an essential resource for the development of multilingual natural language processing applications, in particular statistical and neural machine translation (Brown et al, 1990; Bahdanau et al, 2014)
We followed the STACC approach in (Etchegoyhen et al, 2016; Etchegoyhen and Azpeitia, 2016), which is based on seed lexical translations, simple set expansion operations and the Jaccard similarity coefficient (Jaccard, 1901)
We describe STACCw, an extension of the approach with a word weighting scheme, and show that it provides significant improvements on the datasets provided for the BUCC 2017 shared task, while maintaining the portability of the original approach

Summary

Introduction

Parallel corpora are an essential resource for the development of multilingual natural language processing applications, in particular statistical and neural machine translation (Brown et al, 1990; Bahdanau et al, 2014). We followed the STACC approach in (Etchegoyhen et al, 2016; Etchegoyhen and Azpeitia, 2016), which is based on seed lexical translations, simple set expansion operations and the Jaccard similarity coefficient (Jaccard, 1901) This method has been shown to outperform state-of-the-art alternatives on a large range of alignment tasks and provides a simple yet effective procedure that can be applied across domains and corpora with minimal adaptation and deployment costs. For scenarios where the alignment space is large, target sentences are first indexed using the Lucene search engine and retrieved by building a query over the expanded translation sets created from each source sentence This strategy drastically reduces the computational load, at the cost of missing some correct alignment pairs.

Weighted STACC

BUCC 2017 Shared Task

Experimental Settings

Results

Conclusion