Doubts on the reliability of parallel corpus filtering

Hyeonseok Moon,Chanjun Park,Seonmin Koo,Jungseob Lee,Seungjun Lee,Jaehyung Seo,Sugyeong Eo,Yoonna Jang,Hyunjoong Kim,Hyoung-Gyu Lee,Heuiseok Lim

doi:10.1016/j.eswa.2023.120962

Abstract

Parallel corpus filtering (PCF) aims to filter out low-quality data residing in parallel corpora. Recently, deep learning-based methods have been employed to assess the quality of sentence pairs in a parallel corpus, along with rule-based filtering that filters out noisy data depending on the pre-defined error types. Despite their utilization, to the best of our knowledge, a comprehensive investigation into the practical applicability and interpretability of PCF techniques remains unexplored. In this study, we raise two doubts on deep learning-based PCF: (i) Can deep learning-based PCF extract high-quality data? and (ii) Are scoring functions of PCF reliable? To answer these questions, we conduct comparative experiments on various PCF techniques with four datasets on two language pairs, English–Korean, and English–Japanese. Through the experiments, we demonstrate that the performance of the deep learning-based PCF highly depends on the targeting parallel corpus, and shows fluctuating adaptability depending on their characteristics. In particular, we figure out that high-scored sentences derived by the PCF technique do not necessarily guarantee high-quality results, rather it shows unintended preference.

Full Text