The fast growth of the technology allowed for the easy and extensive dissemination of information on various topics, including business, marketing, news, and viewpoints on geopolitical situation. Although the number of studies on opinion mining is fast increasing, most of these studies focus on issues related to resource-rich languages. Resource-poor languages like Roman Urdu have been neglected for a long time, although having vast potential for research and the representation of almost 500 million people. This research in Roman Urdu is done using machine learning methods because of the nonavailability of the large and standard corpus. Due to research gaps in Roman Urdu sentiment analysis, there are few publicly available corpus for research purposes that are not large in size and not good in quality to get the promising results using Deep Learning (DL) methods. The key contributions of this research work are the enhancement of the existing Roman Urdu corpus and the use of hybrid Convolutional Neural Network-Bidirectional LSTM (CNN-BiLSTM) model on enhanced Roman Urdu corpus with fine tuning. For enhancement of the Roman Urdu corpus, the reviews are collected from multiple online sources such as politics, sports, entertainment, and food. Moreover, the enhanced corpus is annotated manually by two annotators A and B, following the guidelines given by researchers and statically validated by computing the Cohen’s Kappa score, which is moderate. The conflicted reviews are validated through another annotator C. Finally, the experiments are performed in binary class and multi-class using hybrid DL methods as well as the ML-based models. The experiments show the outperformance of hybrid CNN-BiLSTM model as compared to existing models of Recurrent Convolutional Neural Network (RCNN), RNN, LSTM, SVM, CRF, and Rule-based model as per the evaluation parameters such as with accuracy of 0.774 binary class and 0.721 multi-class on the enhanced Roman Urdu corpus.
Read full abstract