Abstract

As a potential bilingual resource, loanwords play a very important role in many natural language processing tasks. If loanwords in a low-resource language can be identified effectively, the generated donor-receipt word pairs will benefit many cross-lingual natural language processing tasks. However, most studies on loanword identification mainly focus on formal texts such as news and government documents. Loanword identification in social media texts is still an under-studied field. Since it faces many challenges and can be widely used in several downstream tasks, more efforts should be put on loanword identification in social media texts. In this study, we present a multi-task learning architecture with deep bi-directional recurrent neural networks for loanword identification in social media texts, where different task supervision can happen at different layers. The multi-task neural network architecture learns higher-order feature representations from word and character sequences along with basic spell error checking, part-of-speech tagging, and named entity recognition information. Experimental results on Uyghur loanword identification in social media texts in five donor languages (Chinese, Arabic, Russian, Turkish, and Farsi) show that our method achieves the best performance compared with several strong baseline systems. We also combine the loanword detection results into the training data of neural machine translation for low-resource language pairs. Experiments show that models trained on the extended datasets achieve significant improvements compared with the baseline models in all language pairs.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call