Sentiment Analysis on Twitter Data using Apache Spark Framework

Hossam Elzayady,Gouda I Salama,Khaled M Badran

doi:10.1109/icces.2018.8639195

Abstract

Sentiment analysis has become an interesting field for both research and industrial domains. The expression sentiment refers to the feelings or thought of the person across some certain issues. Furthermore, it is also considered a direct application for opinion mining. The huge amount of tweets jotted down daily makes Twitter a rich source of textual data and one of the most essential data volumes; therefore, this data has different aims, such as business, industrial or social aims according to the data requirement and needed processing. Actually, the amount of data, which is massive, grows rapidly per second and this is called big data which requires special processing techniques and high computational power in order to perform the required mining tasks. In this work, we perform a sentiment analysis with the help of Apache Spark framework, which is considered an open source distributed data processing platform which utilizes distributed memory abstraction. The goal of using Apache Spark’s Machine learning library (MLIB) is to handle an extraordinary amount of data effectively. We recommend some Preprocessing and Machine learning text feature extraction steps for getting greater results in Sentiment Analysis classification. The effectiveness of our proposed approach is proved against other approaches achieving better classification results when using Naive Bayes, Logistic Regression and Decision trees classification algorithms. Finally, our solution estimates the performance of Apache Spark concerning its scalability.

Full Text