Abstract

This work proposes state of the art technique to examine composition patterns and topological structure of Urdu language. The improved method explores Urdu text in form of co-occurrence network graph within framework of complex network theory. For the first time, Urdu text is successfully transformed into graph despite having difficulties in dealing with Nastalik script, unavailability of resources and limited support by language processing tools. We have constructed an open and unannotated corpus of more than 3 million words using random forest approach. An un-directed, un-weighted graph from co-occurrence network of Urdu is created in python 3.4. Resulting network designed with bag of bigrams model consists of 5180 nodes and 101415 edges. Deep statistical analysis of graph is performed in graph visualization tool Gephi 0.9.2. Furthermore, a null model of similar size according to Erdos-Renyi random graph is generated to compare with Urdu network. Comparison is based on average path length, clustering coefficient and hierarchy of both networks. From analysis of these key features, it is observed that Urdu network graph differs from random network. Smaller average path length and high clustering coefficient also confirm small world effect in Urdu language. Additionally, 11 communities are detected in Urdu network unlike random network where only one community exists. Statistical facts reveal that Urdu network is a scale free network with layered composition pattern. Small world effect and scale free behavior of Urdu declare it a complex network with paradigmatic hierarchy in terms of authority distribution among words.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call