A dataset to evaluate Hindi Word Embeddings

Dinesh Gopalani,Vimal Kumar Soni,M C Govil

doi:10.1088/1757-899x/1131/1/012015

Abstract

The current trend to solve different challenges of Natural Language Processing utilizes various online crawling methods to fetch the data and applying different shallow or deep learning methods to develop models for the respective tasks on this data. Word vectors generated using such methods are being applied for several NLP challenges and such vectors are being evaluated on word similarity task. Not only huge data is available but also multiple datasets are available for the English language to evaluate the performance of the developed models. However, the scenario is not the same for Indian languages specifically for Hindi. Focusing this challenge, we propose a dataset to check word similarity in Hindi. The construction process and afterwards annotation process are described in details. To construct this dataset, first, 353 word-pairs from the most popular English dataset are selected and translated. Their translations are verified by Hindi Experts. These word pairs are finally annotated independently by 11 native Hindi speakers. Multiple criteria have been set to select the annotators for this task. The final dataset has been evaluated on CBOW and Skip-gram models.

Full Text