Abstract

As a potential bilingual resource, loanwords play a very important role in many natural language processing tasks. If loanwords in a low-resource language can be identified effectively, the generated donor-receipt word pairs will benefit many cross-lingual natural language processing tasks. However, most studies on loanword identification mainly focus on formal texts such as news and government documents. Loanword identification in social media texts is still an under-studied field. Since it faces many challenges and can be widely used in several downstream tasks, more efforts should be put on loanword identification in social media texts. In this study, we present a multi-task learning architecture with deep bi-directional recurrent neural networks for loanword identification in social media texts, where different task supervision can happen at different layers. The multi-task neural network architecture learns higher-order feature representations from word and character sequences along with basic spell error checking, part-of-speech tagging, and named entity recognition information. Experimental results on Uyghur loanword identification in social media texts in five donor languages (Chinese, Arabic, Russian, Turkish, and Farsi) show that our method achieves the best performance compared with several strong baseline systems. We also combine the loanword detection results into the training data of neural machine translation for low-resource language pairs. Experiments show that models trained on the extended datasets achieve significant improvements compared with the baseline models in all language pairs.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.