Abstract
This article presents an analysis of experiments with statistical and neural parsing techniques for Urdu, a widely spoken South Asian language. We demonstrate state of the art constituency parsing results for an Urdu treebank. Urdu is a morphologically rich and is characterized by free word order. Language representation (e.g. input type, lemmatization, word clusters), part of speech tag set, phrase labels and the size of a training corpus are crucial for parsing such languages. In this article, probabilistic context-free grammars, data-oriented parsing, and recursive neural network based models have been experimented with several linguistic features which show improvements in the parsing results. Features include syntactic sub-categorization of POS tags, empirically learned horizontal and vertical markovizations and lexical head words. These features enable dependency information for case markers and add phrasal and lexical context to the parse trees. The data-oriented parsing and recursive neural network model give an f-score of 87.1 by considering gold POS tags in the test set, on textual input, they show a performance with f-scores of 83.4 and 84.2, respectively. To overcome the issue of data sparsity due to the morphological richness, lemmatization and unsupervised word clustering have been performed. A treebank should cover most probable word orders of the language so that models can learn various orders accurately. To analyze the order coverage of the treebank and learning capability of different parsers, a test set has been prepared conditioning different word orders. This test set is evaluated with the best performing parsing models and with gold POS tags, f-scores are above 90 and on textual input, the average f-score is 87.6.
Highlights
Urdu is a morphologically rich language which is written in a version of the Arabic script
We have experimented with probabilistic contextfree grammars (PCFGs), data-oriented parsing (DOP), lexicalized grammars and recursive neural network (RNN)
First representation is the plain Urdu text with surface forms. By using this representation, lexicalized PCFG parser performed with an f-score of 86.1 on gold part of speech (POS) tags and 83.2 by using predicted tags with a tagging accuracy of 95.7%
Summary
Urdu is a morphologically rich language which is written in a version of the Arabic script. Data-oriented and RNN parsing models performed best on the CLE-UTB Both parsers give an f-score of 87.1 with gold POS tags while they produce f-scores of 83.4 and 84.2 on textual input. The statistical parsing has been experimented after replacing the words with their cluster labels by using the predictive exchange word clustering algorithm as discussed in [4] It gave improvements in lexicalized parsing by f-scores of 0.8 and 1.2 on gold POS and textual input, respectively. Urdu has flexible word order best performing DOP and RNN parsing models were evaluated against a test set which we categorized with different word orders This test set contains the sentences having most probable subject-object-verb (SOV), object-subjectverb (OSV) and subject-verb (SV) word orders.
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have