Developing a POS Tagged Corpus of Urdu Tweets

Amber Baig,Ahsanullah Baloch,Mutee U Rahman,Hameedullah Kazi

doi:10.3390/computers9040090

Abstract

Processing of social media text like tweets is challenging for traditional Natural Language Processing (NLP) tools developed for well-edited text due to the noisy nature of such text. However, demand for tools and resources to correctly process such noisy text has increased in recent years due to the usefulness of such text in various applications. Literature reports various efforts made to develop tools and resources to process such noisy text for various languages, notably, part-of-speech (POS) tagging, an NLP task having a direct effect on the performance of other successive text processing activities. Still, no such attempt has been made to develop a POS tagger for Urdu social media content. Thus, the focus of this paper is on POS tagging of Urdu tweets. We introduce a new tagset for POS-tagging of Urdu tweets along with the POS-tagged Urdu tweets corpus. We also investigated bootstrapping as a potential solution for overcoming the shortage of manually annotated data and present a supervised POS tagger with an accuracy of 93.8% precision, 92.9% recall and 93.3% F-measure.

Highlights

Recent years have witnessed immense popularity of social media platforms among Internet users, researchers and organizations from several domains
Tweet mentions of the form “@mshaanshahid” are tagged as PROPN as these usernames represent a real person in the social media world
A new POS-tagged dataset constructed from Urdu tweets is presented along with its tagging scheme, thereby expanding Urdu language Natural Language Processing (NLP) research for the processing of Urdu social media text

Summary

Introduction

Recent years have witnessed immense popularity of social media platforms among Internet users, researchers and organizations from several domains. Twitter allows its users to write or share tweets of up to 280 characters about countless topics such as their opinions about certain aspects of life, reviews of products, films, games, discussions about relationship issues, government affairs, pandemics etc. Language on Twitter, is quite different from well-edited text of news, books, etc., due to the presence of unconventional orthography, punctuation and grammatical mistakes, along with Twitter-specific conventions such as hashtags, emoticons, usernames and retweet tokens [2] Such language style variation is often characterized as noisy user-generated text [3]. The current study is undertaken for the design and implementation of a POS tagging model for Urdu tweets using statistical data-driven methods.

Related Work

Urdu Tweet Part-of-Speech Tagset

Evaluation Metrics

ADJ: Adjective

ADP: Adposition

ADV: Adverb

AUX: Auxiliary

CONJ: Coordinating Conjunction

DET: Determiner

INTJ: Interjection

NOUN: Noun

NUM: Numeral

3.11. PRON: Pronoun

3.12. PROPN: Proper Noun

3.17. RET: Retweet

3.22.X: X:Others

Dataset for Corpus Creation

Pre-Processing

Manual Annotation

Bootstrapping

Discussion

Evaluation

Major and their their

Conclusions

AUX: Auxiliary Verb

Urdu Tweet Corpus

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Computers	Publication Date: Nov 7, 2020
Citations: 5	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Developing a POS Tagged Corpus of Urdu Tweets

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Computers

Lead the way for us

Similar Papers

Part of speech tagging: a systematic review of deep learning and machine learning approaches
Alebachew Chiche ... Betselot Yitagesu
Journal of Big Data | VOL. 9
Alebachew Chiche, et. al.Alebachew Chiche ... Betselot Yitagesu
24 Jan 2022
Journal of Big Data | VOL. 9

Part-of-speech tagging for Arabic tweets using CRF and Bi-LSTM
Wasan Alkhwiter ... Nora Al-Twairesh
Computer Speech & Language | VOL. 65
Wasan Alkhwiter, et. al.Wasan Alkhwiter ... Nora Al-Twairesh
31 Jul 2020
Computer Speech & Language | VOL. 65

A Ruled-Based Part of Speech (RPOS) Tagger for Malay Text Articles
Rayner Alfred ... Joe Henry Obit
-
Rayner Alfred, et. al.Rayner Alfred ... Joe Henry Obit
01 Jan 2013
01 Jan 2013

A REVIEW ON DIFFERENT APPROACHES OF POS TAGGING IN NLP
K Aparna ... Pooja Bhakta
-
K Aparna, et. al. K Aparna ... Pooja Bhakta
01 Jan 2020
01 Jan 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Developing a POS Tagged Corpus of Urdu Tweets

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Computers