Efficient document alignment across scenarios

Andoni Azpeitia,Thierry Etchegoyhen

doi:10.1007/s10590-019-09234-9

Abstract

We present and evaluate an approach to document alignment meant for efficiency and portability, as it relies on automatically extracted lexical translations and simple set-theoretic operations for the computation of document-level similarity. We compare our approach to the state of the art on a variety of alignment scenarios, showing that it outperforms alternative document-alignment methods in the vast majority of cases, on both parallel and comparable corpora. We also explore several forms of simple component optimisation to evaluate the potential for improvement of the core method, and describe several successful optimisation paths that lead to significant improvements over strong baselines. The proposed approach constitutes an effective and easy to deploy method to perform accurate document alignment across scenarios, with the potential to improve the creation of parallel corpora.

Full Text