Abstract

We present Shabd, a psycholinguistic database in Hindi. It is based on a corpus of 1.4 billion words from electronic newspapers and news websites. Word frequencies and part of speech information have been derived and are made available in a cleaned list of 34 thousand hand-selected words, and a list of 96 thousand words observed with a frequency of more than 100 times in the corpus. Next to the Shabd database, we also make a list with all 2.3 million word types available and a list with the 2.5 million most frequent word pairs (word bigrams). The quality of the word frequency measure was tested in two lexical decision tasks. We observed that the Shabd word frequencies outperform existing frequencies based on smaller corpora of newspapers but not the Worldlex word frequencies based on an analysis of blogs. We also observed that word frequency accounts for as much variance as contextual diversity (operationalized as the number of documents in which the words were observed). The Shabd database is freely available for research.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call