Abstract

This paper presents an open source and extendable Morphological Analyser cum Generator (MAG) for Tamil named ThamizhiMorph. Tamil is a low-resource language in terms of NLP processing tools and applications. In addition, most of the available tools are neither open nor extendable. A morphological analyser is a key resource for the storage and retrieval of morphophonological and morphosyntactic information, especially for morphologically rich languages, and is also useful for developing applications within Machine Translation. This paper describes how ThamizhiMorph is designed using a Finite-State Transducer (FST) and implemented using Foma. We discuss our design decisions based on the peculiarities of Tamil and its nominal and verbal paradigms. We specify a high-level meta-language to efficiently characterise the language’s inflectional morphology. We evaluate ThamizhiMorph using text from a Tamil textbook and the Tamil Universal Dependency treebank version 2.5. The evaluation and error analysis attest a very high performance level, with the identified errors being mostly due to out-of-vocabulary items, which are easily fixable. In order to foster further development, we have made our scripts, the FST models, lexicons, Meta-Morphological rules, lists of generated verbs and nouns, and test data sets freely available for others to use and extend upon.

Highlights

  • The Web contains a large and rapidly-growing textual volume of Tamil, a Southern Dravidian language of South Asia.1 Several organisations and individuals are working on Tamil language computing

  • We are in the process of developing a Parallel Grammar (ParGram) style computational grammar for Tamil, which requires a morphological analyser with a good, precision coverage, implemented with the use of a finite-state approach that interfaces with the grammar

  • We have developed a tool to populate the ThamizhiMorph morphological annotations to the CoNLL-U format16 which is used in Universal Dependencies (UD) treebanks annotation as well

Read more

Summary

Introduction

The Web contains a large and rapidly-growing textual volume of Tamil, a Southern Dravidian language of South Asia. Several organisations and individuals are working on Tamil language computing. Tamil is spoken natively by more than 80 million people across the world It has been recognised as a classical language by the government of India since it has more than 2000 years of continuous and unbroken literary tradition (Hart 2000). It is one of the official languages of Sri Lanka and Singapore, and has regional official status in Tamil Nadu and Pondicherry, India. It has been recognised as a minority or indigenous language in several countries including Malaysia, Mauritius, and South Africa, and is taught there as a second language.

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call