Abstract

The emergence of the World Wide Web facilitates the growth of user-generated texts in less-resourced languages. Sentiment analysis of these texts may serve as a key performance indicator of the quality of services delivered by companies and government institutions. The presence of user-generated texts is an opportunity for assisting managers and policy-makers. These texts are used to improve performance and increase the level of customers’ satisfaction. Because of this potential, sentiment analysis has been widely researched in the past few years. A plethora of approaches and tools have been developed—albeit predominantly for well-resourced languages such as English. Resources for less-resourced languages such as, in this paper, Amharic, are much less developed. As a result, it requires cost-effective approaches and massive amounts of annotated training data, calling for different approaches to be applied. This research investigates the performance of a combination of heterogeneous machine learning algorithms (base learners such as SVM, RF, and NB). These models in the framework are fused by a meta-learner (in this case, logistic regression) for Amharic sentiment classification. An annotated corpus is provided for evaluation of the classification framework. The proposed stacked approach applying SMOTE on TF-IDF characters (1,7) grams features has achieved an accuracy of 90%. The overall results of the meta-learner (i.e., stack ensemble) have revealed performance rise over the base learners with TF-IDF character n-grams.

Highlights

  • With emergence of World Wide Web (WWW) technology, the number of user‐generated texts is increasing

  • The results show the performance of the proposed model for Amharic sentiment classification

  • The main purpose of this study is to investigate building a stacking strategy with a meta‐learner using Term Frequency‐Inverse Document Frequency (TF‐IDF) character (1,7) gram feature sets with a CV of five folds for Amharic sentiment classification relying on base learners (i.e., Support Vector Machine (SVM), Naive Bayesian (NB), and Random Forest (RF)) across four datasets

Read more

Summary

Introduction

With emergence of World Wide Web (WWW) technology, the number of user‐. This is helping businesses/organizations to enhance their ser‐. Vices and products, boosting their revenue and competitiveness by increasing consumer or client satisfaction. As people are using online reviews to promote products and re‐. Ceive feedback about their services/products from their clients anywhere in the world, the amount of opinionated datasets is increasing drastically on a daily basis. Dia platforms, people usually use different formats, such as texts, audio, video, graphics, and images, to express their feelings and opinions about an event/service/product. Textual data are the most relevant and accessible user‐generated content.

Objectives
Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.