Abstract

The Internet has seen substantial growth of regional language data in recent years. It enables people to express their opinion by incapacitating the language barriers. Urdu is a language used by 170.2 million people for communication. Sentiment analysis is used to get insight of people opinion. In recent years, researchers’ interest in Urdu sentiment analysis has grown. Application of deep learning methods for Urdu sentiment analysis has been least explored. There is a lot of ground to cover in terms of text processing in Urdu since it is a morphologically rich language. In this paper, we propose a framework for Urdu Text Sentiment Analysis (UTSA) by exploring deep learning techniques in combination with various word vector representations. The performance of deep learning methods such as Long Short-Term Memory (LSTM), attention-based Bidirectional LSTM (BiLSTM-ATT), Convolutional Neural Networks (CNN) and CNN-LSTM is evaluated for sentiment analysis. Stacked layers are applied in sequential model LSTM, BiLSTM-ATT, and C-LSTM. In CNN, various filters are used with single convolution layer. Role of pre-trained and unsupervised self-trained embedding models is investigated on sentiment classification task. The results obtained show that the BiLSTM-ATT outperformed other deep learning models by accomplishing 77.9% accuracy and 72.7% F1 score.

Highlights

  • Social media forums, blogs, comments, and reviews provide opinionated data about issues, products, and services

  • 3) We proposed the framework where various Deep Learning (DL) techniques such as of stacked Long Short-Term Memory (LSTM), Bidirectional LSTM (BiLSTM)-ATT, Convolutional Neural Networks (CNN), and CLSTM are explored for sentiment classification

  • The rule-based classifier performed better than Bag of Words (BOW) and the model based on Machine Learning (ML) techniques trained with discourse features performed significantly better than the model trained without discourse features

Read more

Summary

Introduction

Social media forums, blogs, comments, and reviews provide opinionated data about issues, products, and services. As in recent pandemic period, a sudden burst of internet usage has been reported[1]. According to Statista[2], there are 4.66 billion active internet users till October 2020. Increased usage of internet encouraged it to transform from monolingual to multilingual platform. The presence of different language websites, including Urdu, has substantially increased. Urdu is an official language of Pakistan and India's schedule language used by millions of people worldwide for communicating. Most visited sites in Pakistan offer their content in Urdu[3]. Urdu presents some challenges for language processing, such as Urdu uses formal and informal verb forms, and each noun has an either masculine or feminine gender. Urdu language has loan words from Persian, Arabic and Sanskrit languages. Urdu is written from right to left and boundary between words is not always distinguishable such as ‘‫( ’ادھرکیارکھاہے‬what lays there) is understandable, it has no space between words

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call