Abstract

This article presents the STACCw system for the BUCC 2017 shared task on parallel sentence extraction from comparable corpora. The original STACC approach, based on set-theoretic operations over bags of words, had been previously shown to be efficient and portable across domains and alignment scenarios. Wedescribe an extension of this approach with a new weighting scheme and show that it provides significant improvements on the datasets provided for the shared task.

Highlights

  • Parallel corpora are an essential resource for the development of multilingual natural language processing applications, in particular statistical and neural machine translation (Brown et al, 1990; Bahdanau et al, 2014)

  • We followed the STACC approach in (Etchegoyhen et al, 2016; Etchegoyhen and Azpeitia, 2016), which is based on seed lexical translations, simple set expansion operations and the Jaccard similarity coefficient (Jaccard, 1901)

  • We describe STACCw, an extension of the approach with a word weighting scheme, and show that it provides significant improvements on the datasets provided for the BUCC 2017 shared task, while maintaining the portability of the original approach

Read more

Summary

Introduction

Parallel corpora are an essential resource for the development of multilingual natural language processing applications, in particular statistical and neural machine translation (Brown et al, 1990; Bahdanau et al, 2014). We followed the STACC approach in (Etchegoyhen et al, 2016; Etchegoyhen and Azpeitia, 2016), which is based on seed lexical translations, simple set expansion operations and the Jaccard similarity coefficient (Jaccard, 1901) This method has been shown to outperform state-of-the-art alternatives on a large range of alignment tasks and provides a simple yet effective procedure that can be applied across domains and corpora with minimal adaptation and deployment costs. For scenarios where the alignment space is large, target sentences are first indexed using the Lucene search engine and retrieved by building a query over the expanded translation sets created from each source sentence This strategy drastically reduces the computational load, at the cost of missing some correct alignment pairs.

Weighted STACC
BUCC 2017 Shared Task
Experimental Settings
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call