Twitter has emerged as a rich source of real-time data, providing valuable insights across various domains such as healthcare and business. Sentiment analysis is crucial in understanding public reactions and sentiments expressed on Twitter, empowering organizations to make informed decisions. However, efficiently analyzing sentiment from social media data, presents a challenge for real-time streaming. Conventional Extract-Transform-Load (ETL) methods, inadequate for this challenge, limit their applicability in processing vast volumes of Twitter data. Big data technologies like Kafka, Spark, Hadoop HDFS, Hive, and HBase have become indispensable in addressing this challenge. To address this, we propose a stream ETL framework for Twitter-based sentiment analysis, leveraging Kafka, Spark, Cassandra, HBase, Hive, and HDFS. Our framework enables data stream processing, bias detection and correction, sentiment-based analysis, and visualization of tweets’ geospatial distribution. We present a set of use case studies to illustrate the applicability of the proposed framework comprises of sentiment classification, and bias detection and correction capability. We also present a comparative study to demonstrate the performance of data streaming processing, analysis, and visualization that has been implemented using multiple big data technologies under different parameter settings. Experimental results demonstrate the framework scalability and trade-off factors of the data stream processing pipeline in the execution of big data processing tasks.
Read full abstract