The TypeCraft Natural Language Database: Annotating and Incorporating Urdu

Sharmin Muzaffar,Lars Hellan,Girish Nath Jha,Pitambar Behera,Dorothee Beermann

doi:10.17485/ijst/2015/v8i27/81728

Abstract

The authors present one of the important Indo-Aryan languages i.e. Urdu on the TypeCraft platform, which is an online, multilingual, and corpus-based, natural language database and a documentary platform for natural languages. Previously, the platform has already incorporated other Indian languages like Telugu, Bengali, Hindi, and Odia. Recently, the platform has been extended to the annotation and incorporation of Urdu. The TC framework has been designed in such a manner that it can facilitate the linguistic annotation up to the level of semantics to enhance the cross-comparison of structures between languages of different families. The recent version of TC 2.2 has taken the level of annotation up to discourse and pragmatics through a closer integration of text and sentence level annotation. Theoretically speaking, the system is applicable to all languages, but practically it is also very specific with regard to encoding the salient syntactic and semantic features. The paper highlights some of the linguistic issues: Agreement, case, verbs, and mood, labeling features, glossing and technical challenges. The current study focuses on Urdu linguistic annotation taking into consideration the annotated data on the said platform.

Full Text