Abstract

One of the important resources required for various Natural Language Processing (NLP) applications like machine translation, information retrieval and text mining, is annotated text corpora. Text corpora annotation process requires parts of speech (POS) tags to mark different parts of text with grammatical annotations in order to identify linguistic properties of a word, sentence or discourse. The process of marking text items is based on two main features 1) grammatical category and 2) context of text (word, sentence or discourse) i.e. relationship with adjacent and related text. Saraiki being one of oldest languages is still resource scarce language in recorded literature as well as in computational context. According to our study, at present, there is no tagset defined for Saraiki language. This work presents first hierarchical POS (MPOST) tag set for the Saraiki language which is designed to be used in morphological, syntactic and lexical annotations of Saraiki language corpora.

Highlights

  • Saraiki is an Indo-Aryan language that is spoken mainly in Southern Punjab of Pakistan and is a minor language in India and resut of the Indian sub-continent

  • We introduce a comprehensive tagset for Saraiki language that could be used for Parts of Speech (POS) tagging of Saraiki language corpora

  • Saraiki language has no tagset at present and in this work, as first effort, we present a detailed hierarchical tag set

Read more

Summary

INTRODUCTION

Saraiki is an Indo-Aryan language that is spoken mainly in Southern Punjab of Pakistan and is a minor language in India and resut of the Indian sub-continent. There are around 26 million native language users in Pakistan and India only1 It is written in Perso-Arabic script it has its own set of alphabets that consists of 45 letters. There are different dialects of this language that include Multani (Main Saraiki), Thalli (Thal region), Rajanpur (Southern Saraiki), Rohi (Cholistan desert and adjacent areas), Thar (Thar desert and all Sindh region), Majhi and Shahpuri It is morphologically a rich language with different tones and well-structured sentence architecture. We review POS tag set development for Urdu and local languages being spoken in Pakistan and neighboring countries. These local languages are mostly similar to each other their writing script can be different. The tag sets are normally divided into three categories: flat tag sets, hierarchical tag sets and fine-grained tag sets [30] [18]

RELATED WORK
TAGSET DESIGN METHODOLOGY
Noun Personal Proper Sur-name
SARAIKI HIERARCHICAL TAGSET
Noun and its sub-categories
CONCLUSION AND FUTURE WORK
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.