A spark‐based big data analysis framework for real‐time sentiment prediction on streaming data

Deniz Kılınç

doi:10.1002/spe.2724

Deniz Kılınç

Open Access

https://doi.org/10.1002/spe.2724

Copy DOI

Journal: Software: Practice and Experience	Publication Date: Jun 27, 2019
Citations: 10	License type: publisher-specific, author manuscript

Affiliation: Manisa Celal Bayar University

Abstract

SummaryThere are many data sources that produce large volumes of data. The Big Data nature requires new distributed processing approaches to extract the valuable information. Real‐time sentiment analysis is one of the most demanding research areas that requires powerful Big Data analytics tools such as Spark. Prior literature survey work has shown that, though there are many conventional sentiment analysis researches, there are only few works realizing sentiment analysis in real time. One major point that affects the quality of real‐time sentiment analysis is the confidence of the generated data. In more clear terms, it is a valuable research question to determine whether the owner that generates sentiment is genuine or not. Since data generated by fake personalities may decrease accuracy of the outcome, a smart/intelligent service that can identify the source of data is one of the key points in the analysis. In this context, we include a fake account detection service to the proposed framework. Both sentiment analysis and fake account detection systems are trained and tested using Naïve Bayes model from Apache Spark's machine learning library. The developed system consists of four integrated software components, ie, (i) machine learning and streaming service for sentiment prediction, (ii) a Twitter streaming service to retrieve tweets, (iii) a Twitter fake account detection service to assess the owner of the retrieved tweet, and (iv) a real‐time reporting and dashboard component to visualize the results of sentiment analysis. The sentiment classification performances of the system for offline and real‐time modes are 86.77% and 80.93%, respectively.

Full Text