Abstract

The public is increasingly using social media platforms such as Twitter and Facebook to express their views on a variety of topics. As a result, social media has emerged as the most effective and largest open source for obtaining public opinion. Single node computational methods are inefficient for sentiment analysis on such large datasets. Supercomputers or parallel or distributed processing are two options for dealing with such large amounts of data. Most parallel programming frameworks, such as MPI (Message Processing Interface), are difficult to use and scale in environments where supercomputers are expensive. Using the Apache Spark Parallel Model, this proposed work presents a scalable system for sentiment analysis on Twitter. A Spark-based Naive Bayes training technique is suggested for this purpose; unlike prior research, this algorithm does not need any disk access. Millions of tweets have been classified using the trained model. Experiments with various-sized clusters reveal that the suggested strategy is extremely scalable and cost-effective for larger data sets. It is nearly 12 times quicker than the Map Reduce-based model and nearly 21 times faster than the Naive Bayes Classifier in Apache Mahout. To evaluate the framework’s scalability, we gathered a large training corpus from Twitter. The accuracy of the classifier trained with this new dataset was more than 80%.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call