Abstract

Processing of social media text like tweets is challenging for traditional Natural Language Processing (NLP) tools developed for well-edited text due to the noisy nature of such text. However, demand for tools and resources to correctly process such noisy text has increased in recent years due to the usefulness of such text in various applications. Literature reports various efforts made to develop tools and resources to process such noisy text for various languages, notably, part-of-speech (POS) tagging, an NLP task having a direct effect on the performance of other successive text processing activities. Still, no such attempt has been made to develop a POS tagger for Urdu social media content. Thus, the focus of this paper is on POS tagging of Urdu tweets. We introduce a new tagset for POS-tagging of Urdu tweets along with the POS-tagged Urdu tweets corpus. We also investigated bootstrapping as a potential solution for overcoming the shortage of manually annotated data and present a supervised POS tagger with an accuracy of 93.8% precision, 92.9% recall and 93.3% F-measure.

Highlights

  • Recent years have witnessed immense popularity of social media platforms among Internet users, researchers and organizations from several domains

  • Tweet mentions of the form “@mshaanshahid” are tagged as PROPN as these usernames represent a real person in the social media world

  • A new POS-tagged dataset constructed from Urdu tweets is presented along with its tagging scheme, thereby expanding Urdu language Natural Language Processing (NLP) research for the processing of Urdu social media text

Read more

Summary

Introduction

Recent years have witnessed immense popularity of social media platforms among Internet users, researchers and organizations from several domains. Twitter allows its users to write or share tweets of up to 280 characters about countless topics such as their opinions about certain aspects of life, reviews of products, films, games, discussions about relationship issues, government affairs, pandemics etc. Language on Twitter, is quite different from well-edited text of news, books, etc., due to the presence of unconventional orthography, punctuation and grammatical mistakes, along with Twitter-specific conventions such as hashtags, emoticons, usernames and retweet tokens [2] Such language style variation is often characterized as noisy user-generated text [3]. The current study is undertaken for the design and implementation of a POS tagging model for Urdu tweets using statistical data-driven methods.

Related Work
Urdu Tweet Part-of-Speech Tagset
Evaluation Metrics
ADJ: Adjective
ADP: Adposition
ADV: Adverb
AUX: Auxiliary
CONJ: Coordinating Conjunction
DET: Determiner
INTJ: Interjection
NOUN: Noun
NUM: Numeral
3.11. PRON: Pronoun
3.12. PROPN: Proper Noun
3.17. RET: Retweet
3.22.X: X:Others
Dataset for Corpus Creation
Pre-Processing
Manual Annotation
Bootstrapping
Discussion
Evaluation
Major and their their
Conclusions
AUX: Auxiliary Verb
Urdu Tweet Corpus
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.