A Novel Parts of Speech (POS) Tagset for morphological, syntactic and lexical annotations of Saraiki language

Muhammad Nabeel Asghar

doi:10.36785/jaes.111459

Abstract

One of the important resources required for various Natural Language Processing (NLP) applications like machine translation, information retrieval and text mining, is annotated text corpora. Text corpora annotation process requires parts of speech (POS) tags to mark different parts of text with grammatical annotations in order to identify linguistic properties of a word, sentence or discourse. The process of marking text items is based on two main features 1) grammatical category and 2) context of text (word, sentence or discourse) i.e. relationship with adjacent and related text. Saraiki being one of oldest languages is still resource scarce language in recorded literature as well as in computational context. According to our study, at present, there is no tagset defined for Saraiki language. This work presents first hierarchical POS (MPOST) tag set for the Saraiki language which is designed to be used in morphological, syntactic and lexical annotations of Saraiki language corpora.

Highlights

Saraiki is an Indo-Aryan language that is spoken mainly in Southern Punjab of Pakistan and is a minor language in India and resut of the Indian sub-continent
We introduce a comprehensive tagset for Saraiki language that could be used for Parts of Speech (POS) tagging of Saraiki language corpora
Saraiki language has no tagset at present and in this work, as first effort, we present a detailed hierarchical tag set

Summary

INTRODUCTION

Saraiki is an Indo-Aryan language that is spoken mainly in Southern Punjab of Pakistan and is a minor language in India and resut of the Indian sub-continent. There are around 26 million native language users in Pakistan and India only1 It is written in Perso-Arabic script it has its own set of alphabets that consists of 45 letters. There are different dialects of this language that include Multani (Main Saraiki), Thalli (Thal region), Rajanpur (Southern Saraiki), Rohi (Cholistan desert and adjacent areas), Thar (Thar desert and all Sindh region), Majhi and Shahpuri It is morphologically a rich language with different tones and well-structured sentence architecture. We review POS tag set development for Urdu and local languages being spoken in Pakistan and neighboring countries. These local languages are mostly similar to each other their writing script can be different. The tag sets are normally divided into three categories: flat tag sets, hierarchical tag sets and fine-grained tag sets [30] [18]

RELATED WORK

TAGSET DESIGN METHODOLOGY

Noun Personal Proper Sur-name

SARAIKI HIERARCHICAL TAGSET

Noun and its sub-categories

CONCLUSION AND FUTURE WORK

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Applied and Emerging Sciences	Publication Date: Jun 8, 2021
Citations: 1	License type: cc-by

R Discovery Prime

R Discovery Prime

A Novel Parts of Speech (POS) Tagset for morphological, syntactic and lexical annotations of Saraiki language

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Applied and Emerging Sciences

Lead the way for us

Similar Papers

A Novel Parts of Speech (POS) Tagset for morphological, syntactic and lexical annotations of Saraiki language
Muhammad Nabeel Asghar
Journal of Applied and Emerging Sciences | VOL. 11
Muhammad Nabeel AsgharMuhammad Nabeel Asghar
08 Jun 2021
Journal of Applied and Emerging Sciences | VOL. 11

Parts of Speech Tagging in Bengali for MWEs Detection
Bipul Syam Purkayastha ... Md Jaynalabedin
International Journal of Computer Applications | VOL. 99
Bipul Syam Purkayastha, et. al.Bipul Syam Purkayastha ... Md Jaynalabedin
20 Aug 2014
International Journal of Computer Applications | VOL. 99

Part of Speech Tagging for Setswana African Language
M.A Dibitso ... P A Owolawi
-
M.A Dibitso, et. al.M.A Dibitso ... P A Owolawi
01 Nov 2019
01 Nov 2019

Hidden Markov Model based Part of Speech Tagging for Nepali language
Abhijit Paul ... Bipul Syam Purkayastha
-
Abhijit Paul, et. al.Abhijit Paul ... Bipul Syam Purkayastha
01 Sep 2015
01 Sep 2015

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Novel Parts of Speech (POS) Tagset for morphological, syntactic and lexical annotations of Saraiki language

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Applied and Emerging Sciences