Analysis of Experiments on Statistical and Neural Parsing for a Morphologically Rich and Free Word Order Language Urdu

Toqeer Ehsan,Sarmad Hussain

doi:10.1109/access.2019.2949950

Toqeer Ehsan, Sarmad Hussain

Open Access

PDF Available

https://doi.org/10.1109/access.2019.2949950

Copy DOI

Export

Save

Cite

Journal: IEEE Access	Publication Date: Jan 1, 2019
Citations: 7	License type: CC BY 4.0

Affiliation: University of Engineering and Technology Lahore

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

This article presents an analysis of experiments with statistical and neural parsing techniques for Urdu, a widely spoken South Asian language. We demonstrate state of the art constituency parsing results for an Urdu treebank. Urdu is a morphologically rich and is characterized by free word order. Language representation (e.g. input type, lemmatization, word clusters), part of speech tag set, phrase labels and the size of a training corpus are crucial for parsing such languages. In this article, probabilistic context-free grammars, data-oriented parsing, and recursive neural network based models have been experimented with several linguistic features which show improvements in the parsing results. Features include syntactic sub-categorization of POS tags, empirically learned horizontal and vertical markovizations and lexical head words. These features enable dependency information for case markers and add phrasal and lexical context to the parse trees. The data-oriented parsing and recursive neural network model give an f-score of 87.1 by considering gold POS tags in the test set, on textual input, they show a performance with f-scores of 83.4 and 84.2, respectively. To overcome the issue of data sparsity due to the morphological richness, lemmatization and unsupervised word clustering have been performed. A treebank should cover most probable word orders of the language so that models can learn various orders accurately. To analyze the order coverage of the treebank and learning capability of different parsers, a test set has been prepared conditioning different word orders. This test set is evaluated with the best performing parsing models and with gold POS tags, f-scores are above 90 and on textual input, the average f-score is 87.6.

Highlights

Urdu is a morphologically rich language which is written in a version of the Arabic script
We have experimented with probabilistic contextfree grammars (PCFGs), data-oriented parsing (DOP), lexicalized grammars and recursive neural network (RNN)
First representation is the plain Urdu text with surface forms. By using this representation, lexicalized PCFG parser performed with an f-score of 86.1 on gold part of speech (POS) tags and 83.2 by using predicted tags with a tagging accuracy of 95.7%

Summary

INTRODUCTION

Urdu is a morphologically rich language which is written in a version of the Arabic script. Data-oriented and RNN parsing models performed best on the CLE-UTB Both parsers give an f-score of 87.1 with gold POS tags while they produce f-scores of 83.4 and 84.2 on textual input. The statistical parsing has been experimented after replacing the words with their cluster labels by using the predictive exchange word clustering algorithm as discussed in [4] It gave improvements in lexicalized parsing by f-scores of 0.8 and 1.2 on gold POS and textual input, respectively. Urdu has flexible word order best performing DOP and RNN parsing models were evaluated against a test set which we categorized with different word orders This test set contains the sentences having most probable subject-object-verb (SOV), object-subjectverb (OSV) and subject-verb (SV) word orders.

BACKGROUND

GRAMMAR FORMALISMS AND PARSING FEATURES

RECURSIVE NEURAL NETWORK BASED PARSER

LEXICALIZED PCFG

Findings

DISCUSSION

CONCLUSION

Full Text

Published Version (Free)

View/Download pdf

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

Analysis of Experiments on Statistical and Neural Parsing for a Morphologically Rich and Free Word Order Language Urdu

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Similar Papers

Russian Tagging and Dependency Parsing Models for Stanford CoreNLP Natural Language Toolkit
Liubov Kovriguina ... Alina Putintseva
-
Liubov Kovriguina, et. al.Liubov Kovriguina ... Alina Putintseva
01 Jan 2017
01 Jan 2017

The problem of computing the most probable tree in data-oriented parsing and stochastic tree grammars
Rens Bod
-
Rens BodRens Bod
01 Jan 1995
01 Jan 1995

Morphology and word order in Slavic languages: Insights from annotated corpora
Jianwei Yan
Voprosy Jazykoznanija | VOL. -
Jianwei YanJianwei Yan
01 Jan 2020
Voprosy Jazykoznanija | VOL. -

Scalable discriminative parsing for German
Yannick Versley ... Ines Rehbein
-
Yannick Versley, et. al.Yannick Versley ... Ines Rehbein
01 Jan 2009
01 Jan 2009

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

Analysis of Experiments on Statistical and Neural Parsing for a Morphologically Rich and Free Word Order Language Urdu

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: IEEE Access