Abstract

The paper is an attempt to compare Hyderabad Telugu Treebank (HTTB) and HCU-IIIT-H Telugu Treebank from a statisticalpoint of view. HTTB has 2,715 annotated sentences and HCU-IIIT-H TTB has 3,222 annotated sentences. Both the Treebanks were annotated by following Paninian Grammar Formalism proposed by Bharati, A.; Sharma, D.M.; Husain, S.; Bai, L.; Begam, R. and Sangal, R.(2009).HTTB is an inter-chunk-based treebank data. HCU-IIIT-H TTB is an intra-chunk-based treebankdata. Both the treebanks’ data size is random. Later, the paper discusses the Telugu Treebanks in detail. The paper focuses on statistical frequencies viz. POS, Chunk and Syntactic labels. VM (3807 times) and NN (5486 times) are the frequent POS labels inHTTB and HCU-IIIT-H TTB respectively. NP (7954 and 6223 times) is the frequent phrasal category in both the treebanks. The most frequent k-labels are kartā(k1) (2375-2381 times) and karma(k2) (1408-1437 times) and non-frequent label is karaṇa(k3) (17-39 times) in both the treebanks. The most frequent non-k-labels are verb modifier (vmod) (949 times) and noun modifier (nmod) (1033 times) in both the treebanks. The statistical distribution mentions the coverage of the labels (kāraka, non-kāraka) of both theTelugu treebanks. Later it discusses the comparison of both the treebanks and tries to provide the reasons for the highest and lowest frequencies in both the treebanks. k1 and k2 have 60% of the coverage in karaka labels, vmod, nmod, adv, ccof, pof also has 60% of the coverage in non-karaka labels. This kind of statistical study can help to boost the accuracy of the parser.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call