Employing word mover's distance for cross‐lingual plagiarized text detection

Chia‐Ming Chang,San‐Yih Hwang,Chia‐Hsuan Chang

doi:10.1002/pra2.229

Abstract

AbstractWhile the blossom of globalization has achieved many successes, cross‐lingual plagiarism, however, has become popular. Detecting plagiarism across languages requires the capability of comparing semantical similarities between texts of different languages. Previous works rely on massive bilingual resources such as comparable corpus, parallel corpus, and even commercial machine translation as references. However, towards domain‐specific applications, collecting such resources is labor‐intensive and impractical. Also, the absence of interpretability of existing methods lead to the difficulty of investigating retrieval results by humans. Hence, it is imperative to have a resource‐light and interpretable method for cross‐lingual plagiarism detection. In this study, we propose a new detection method, called CL‐WMD, which is built upon word embedding techniques. CL‐WMD requires only a small set of translation pairs to constitute a bilingual reference and calculates semantical distances between texts by word mover's distance, which can provide explicable word alignment information between two compared text spans. Our experiments are conducted under a bilingual scientific publication corpus composed of two typologically diverse languages: English and Chinese. The results demonstrate that CL‐WMD has higher accuracy than most existing methods and achieves better or comparable performance when compared to the translation‐based method in paragraph‐level and sentence‐level plagiarism detection tasks.

Full Text