Large Scale Quantitative Analysis of three Indo-Aryan Languages

Parth Mehta,Prasenjit Majumder

doi:10.1080/09296174.2015.1071151

Abstract

In this paper, we present a thorough quantitative analysis of large scale media text of three Indo-Aryan languages, viz. Hindi, Gujarati and Bengali. Population wise they together amount to 600 million speakers. Understanding and processing media text is very important from sociological, cultural and information science/theoretic stand points. We did a detailed study to understand the statistical nature of these data. The study demonstrates effect of size and category of media text on term distributions. We establish that while higher order n-grams tend to follow Zipf’s law, the same is not always true for unigrams. We attempt to model the change in term distribution in two separate parts: effect on steepness of the term distribution and that on the tail of the term distribution. To the best of our knowledge this is the first exploratory study of these three languages on such a large scale.

Full Text