Abstract
Classification model performance can be degraded by label noise in the training set. The sentiment classification domain also struggles with this issue, whereby customer reviews can be mislabeled. Some customers give a rating score for a product or service that is inconsistent with the review content. If business owners are only interested in the overall rating picture that includes mislabeling, this can lead to erroneous business decisions. Therefore, this issue became the main challenge of this study. If we assume that customer reviews with noisy labels in the training data are validated and corrected before the learning process, then the training set can generate a predictive model that returns a better result for the sentiment analysis or classification process. Therefore, we proposed a mechanism, called polarity label analyzer, to improve the quality of a training set with noisy labels before the learning process. The proposed polarity label analyzer was used to assign the polarity class of each sentence in a customer review, and then polarity class of that customer review was concluded by voting. In our experiment, datasets were downloaded from TripAdvisor and two linguistic experts helped to assign the correct labels of customer reviews as the ground truth. Sentiment classifiers were developed using the k-NN, Logistic Regression, XGBoost, Linear SVM and CNN algorithms. After comparing the results of the sentiment classifiers without training set improvement and the results with training set improvement, our proposed method improved the average scores of F1 and accuracy by 20.59%.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.