Loanword identification based on web resources: A case study on wikipedia

Chenggang Mi

doi:10.1016/j.csl.2023.101517

Abstract

To alleviate the resource scarcity and improve the robustness in loanword identification, the current study proposes a novel loanword identification method based on Wikipedia. In this paper, we first present how to obtain loanword candidate datasets and comparable corpora from Wikipedia. On the basis of these corpora, we develop a pseudo-data generation model for loanword identification tasks. And then we put forward a loanword identification model, i.e. the PK-SM-Bi-LSTM-CRF framework, which is based on a bidirectional LSTM-CRF framework and further enhanced by prior knowledge and self-matching attention. The advantages of the proposed method mainly lie in two aspects. For one thing, besides the commonly used word embedding and character embedding features, several other features, including subword embedding, lexical similarity, word alignment and semantic similarity, are incorporated into our method. For another, geographic distance is set as a primary principle in the selection of the best matched donor word from several candidates. Moreover, in order to evaluate the effectiveness of the proposed method, we have conducted a series of experiments in different languages. Experimental results show that the proposed method achieves the best performance among all baseline systems.

Full Text