Abstract

Empirical laws are the statistical laws that describe the relation between entities in a large dataset. They are readily found in nature, and findings have been proven by observations [1]. The primary objective of this study is to verify some of the empirical laws such as Zipf’s law, Mandelbrot’s approximation, and Heap’s law for Hindi language corpus. This involves collecting a corpus, performing text normalization, tokenizing it to get a list of words, identifying word types and their frequency, sorting and ranking the data based on frequency, and representing the relation between the frequency and rank of the word types to validate Zipf’s law and Mandelbrot’s approximation. For Heap’s law, the relation between the number of word types and tokens for different subsets of the corpus is considered. Based on our observations, the Hindi language satisfies the laws mentioned above.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.