Empirical Laws of Natural Language Processing for Hindi Language

Arun Babhulgaonkar,Adwait Tekale,Hrishikesh Khandare,Manali Musale,Atharv Kurdukar,Mahesh Shirsath

doi:10.1007/978-981-15-7234-0_18

Abstract

Empirical laws are the statistical laws that describe the relation between entities in a large dataset. They are readily found in nature, and findings have been proven by observations [1]. The primary objective of this study is to verify some of the empirical laws such as Zipf’s law, Mandelbrot’s approximation, and Heap’s law for Hindi language corpus. This involves collecting a corpus, performing text normalization, tokenizing it to get a list of words, identifying word types and their frequency, sorting and ranking the data based on frequency, and representing the relation between the frequency and rank of the word types to validate Zipf’s law and Mandelbrot’s approximation. For Heap’s law, the relation between the number of word types and tokens for different subsets of the corpus is considered. Based on our observations, the Hindi language satisfies the laws mentioned above.

Full Text