Abstract

Abstract: Sentiment analysis has become a stimulating field for both research and industrial domains. The expression sentiment refers to the emotions or thought of the person across some certain issues. Furthermore, it's also considered an immediate application for opinion mining. the large amount of tweets jotted down daily makes Twitter an upscale source of textual data and one among the foremost essential data volumes; therefore, this data has different aims, like business, industrial or social aims consistent with the info requirement and needed processing. Actually, the quantity of knowledge , which is very large , grows rapidly per second and this is often called big data which needs special processing techniques and high computational power so as to perform the specified mining tasks. during this work, we perform a sentiment analysis with the assistance of PySpark framework, an interface for Apache Spark in Python which is taken into account an open source distributed processing platform which utilizes distributed memory abstraction. The goal of using PySpark is that we can run applications parallelly on the distributed cluster (multiple nodes). The effectiveness of our proposed approach is proved against other approaches achieving better classification results when using Naïve Bayes, Logistic Regression and Decision trees classification algorithms. Finally, our solution estimates the performance of Apache Spark concerning its scalability. Keywords: Python, Pyspark, Machine learning, tweepy, tokenization.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call