Abstract
It has been shown that the use of pre-trained models (PTMs) can significantly improve the performance of natural language processing (NLP) tasks for language with rich resources, and also reduce the amount of labeled sample data required in supervised learning. However, there are still few research and shared task datasets available for Hindi, and PTMs for the Romanized Hindi script has been rarely released. In this work, we construct a Hindi pre-training corpus in Devanagari and Romanized scripts, and train Hindi pre-trained models with two versions: Hindi-Devanagari-Roberta and Hindi-Romanized-Roberta. We evaluate our model on 5 types of downstream NLP tasks with 8 datasets, and compare them with existing Hindi pre-training models and commonly used methods. Experimental results show that the model proposed in this work can achieve the best results on the all tasks, especially on Part-of-Speech Tagging and Named Entity Recognition tasks, which proves the validity and superiority of our Hindi pre-trained models. Specifically, the performance of Devanagari Hindi pre-trained model is better than the Romanized Hindi pre-trained model in the tasks of single-label Text Classification, Part-of-Speech Tagging, Named Entity Recognition, and Natural Language Inference. However, Romanized Hindi pre-trained model performs better in multi-label Text Classification and Machine Reading Comprehension, which may indicate that the pre-trained model of Romanized Hindi script has advantages in such tasks. We will publish our model to the community with the intention of promoting the future development of Hindi NLP.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have