The rise of social media platforms has revolutionized the way consumers interact with retailers and express their opinions on products and services. Online retailers particularly need to keep a close eye on customer sentiment in real-time to make informed decisions about their offerings and improve customer satisfaction. However, efficiently analysing large volumes of unstructured text data from social media in real-time poses a significant challenge. This research aimed to develop a scalable, real-time sentiment analysis system tailored for online retailers using Reddit as the data source. The system comprises three main components: a data extraction and streaming pipeline, a sentiment analysis model, and a web application with real-time analytics. To address the data extraction challenge, a job queue-based system was implemented using Node.js, ‘BullMQ’, and Redis to create and manage campaigns for data streaming from Reddit. The data was streamed using Kafka, a distributed streaming platform, to enable efficient real-time processing. The sentiment analysis model was developed using a Naive Bayes classifier after experimenting with other machine learning and deep learning techniques. In the conducted study, the sentiment analysis model's performance was evaluated using standard metrics tailored to the context of online retail sentiment analysis. An accuracy of 0.6737 was achieved, reflecting the model's ability to correctly classify approximately 67.37 per cent of the sentiments in the test data. Concurrently, an F1 score of 0.7894 was recorded and the Area Under the Curve (AUC) value on the test data was measured at 0.5468, a metric that, while acceptable, suggests room for further refinement in the model's discriminatory ability between classes. The integration of the Data Version Control (DVC) system provided a mechanism for fine-tuning the model according to specific data requirements of various tenants. These results, taken together, not only validate the feasibility of employing a Naive Bayes classifier for real-time sentiment analysis in the retail context, but also provide a baseline for future research aimed at enhancing both the accuracy and efficiency of sentiment classification. The project’s evaluation focused on the performance of the sentiment analysis model, the efficiency of the Kafka streaming and real-time Spark pre-processing pipeline, and the backend infrastructure, including the job queuing system and WebSocket implementation. Various evaluation techniques, such as graphs and literature comparisons, were used to assess the system’s performance. In conclusion, this project successfully demonstrated the feasibility of a scalable, real-time sentiment analysis system for online retailers using Reddit data. The system has the potential to help retailers better understand customer opinions and make data-driven decisions for their businesses. Future work could include exploring alternative data sources, experimenting with more advanced sentiment analysis techniques, and enhancing the web application’s user interface and analytics capabilities.
Read full abstract