Abstract
People from all over the world can express their thoughts and opinions through various online social media platforms. People use social media platforms online daily to communicate with one another and stay informed about current events. A large number of tweets covering a wide range of subjects are sent to Twitter daily. Twitter is one of the most well-known and widely used online social media platforms. Extracting features and locating trends can be accomplished through the use of machine learning algorithms. Tools and strategies designed specifically for working with large amounts of data are required to successfully extract useful information from the never-ending stream of data that is produced by Twitter. In this paper, we mainly focus on hashtag identification and identify the industry that possesses the highest share of voice. In this paper, we collect live data from Twitter by using Apache Spark. After that, we classify each tweet by making use of the machine learning techniques that are provided by the Apache Spark machine learning library. To test the model, Convolution neural network (CNN) and logistic regression (LR) are being utilized. The CNN method outperformed the Logistic Regression strategy by performing with an accuracy of approximately 95% on average and scoring 0.60 on the F1 scale. Both the accuracy and the F1 score are currently sitting at 0.59. According to the findings, real-time tweets can be evaluated considerably more quickly using the Apache Spark tool for big data as opposed to the conventional execution environment. The results show that real-time tweets can be evaluated much faster using the Apache Spark tool for big data instead of the usual execution environment.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have