Predicting Supervise Machine Learning Performances for Sentiment Analysis Using Contextual-Based Approaches

Azwa Abdul Aziz,Andrew Starkey

doi:10.1109/access.2019.2958702

Abstract

Sentiment Analysis (SA) is focused on mining opinion (identification and classification) from unstructured text data such as product reviews or microblogs. It is widely used for brand reviews, political campaigns, marketing analysis or gaining feedback from customers. One of the prominent approaches for SA is using supervised machine learning (SML), an algorithm that uses datasets with defined class labels based on mathematical learning from a training dataset. While the results are promising especially with in-domain sentiments, there is no guarantee the model provides the same performance against real time data due to the diversity of new data. In addition, previous studies suggest the result of SML decrease when applied to cross-domain datasets because new features are appeared in different domains. So far, studies in SA emphasise the improvement of the sentiment result whereas there is little discussion focusing on how to detect the degradation of performance for the proposed model. Therefore, we provide a method known as Contextual Analysis (CA), a mechanism that constructs a relationship between words and sources that is constructed in a tree structure identified as Hierarchical Knowledge Tree (HKT). Then, Tree Similarity Index (TSI) and Tree Differences Index (TDI), a formula generate from tree structure are proposed to find similarity as well as changes between train and actual dataset. The regression analysis of datasets reveals that there is a highly significant positive relationship between TSI and SML accuracies. As a result, the prediction model created indicated estimation error within 2.75 to 3.94 and 2.30 for 3.51 for average absolute differences. Moreover, this method also can cluster sentiment words into positive and negative without having any linguistics resources used and at the same time capturing changes of sentiment words when a new dataset is applied.

Highlights

Sentiment analysis (SA) can be described as a computational study to assess people’s attitudes, appraisals, and opinions about individuals, issues, entities, topics, events, and products as well as their attributes [1]
This paper focuses on supervised approaches as the majority of practical machine learning for textual analysis use supervised learning for sentiment analysis studies
A novel approach known as Contextual Analysis (CA) is proposed to find the relationship between words and sources which can provide a mechanism to predict supervised machine learning (SML) model performance

Summary

INTRODUCTION

Sentiment analysis (SA) can be described as a computational study to assess people’s attitudes, appraisals, and opinions about individuals, issues, entities, topics, events, and products as well as their attributes [1]. The performance of ML drop when they are applied to new datasets from cross-domain sources which contain different features (words) when compared to training data. The study looks at applying data from different domains for training and testing (e.g.: train: book and DVD domain, test: hotel reviews) with the result stated in range of 50 – 75% which is low compare to in-domain dataset. This result proves how the similarity of domains will influence the end results of the model. In part five, the conclusions and discussion of potential improvements are made

RELATED WORKS

Findings

CONCLUSION AND FUTURE WORK