Abstract
This article presents the STACCw system for the BUCC 2017 shared task on parallel sentence extraction from comparable corpora. The original STACC approach, based on set-theoretic operations over bags of words, had been previously shown to be efficient and portable across domains and alignment scenarios. Wedescribe an extension of this approach with a new weighting scheme and show that it provides significant improvements on the datasets provided for the shared task.
Highlights
Parallel corpora are an essential resource for the development of multilingual natural language processing applications, in particular statistical and neural machine translation (Brown et al, 1990; Bahdanau et al, 2014)
We followed the STACC approach in (Etchegoyhen et al, 2016; Etchegoyhen and Azpeitia, 2016), which is based on seed lexical translations, simple set expansion operations and the Jaccard similarity coefficient (Jaccard, 1901)
We describe STACCw, an extension of the approach with a word weighting scheme, and show that it provides significant improvements on the datasets provided for the BUCC 2017 shared task, while maintaining the portability of the original approach
Summary
Parallel corpora are an essential resource for the development of multilingual natural language processing applications, in particular statistical and neural machine translation (Brown et al, 1990; Bahdanau et al, 2014). We followed the STACC approach in (Etchegoyhen et al, 2016; Etchegoyhen and Azpeitia, 2016), which is based on seed lexical translations, simple set expansion operations and the Jaccard similarity coefficient (Jaccard, 1901) This method has been shown to outperform state-of-the-art alternatives on a large range of alignment tasks and provides a simple yet effective procedure that can be applied across domains and corpora with minimal adaptation and deployment costs. For scenarios where the alignment space is large, target sentences are first indexed using the Lucene search engine and retrieved by building a query over the expanded translation sets created from each source sentence This strategy drastically reduces the computational load, at the cost of missing some correct alignment pairs.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have