Exploratory data analysis on Reddit data: An efficient pipeline for classification of flairs

Reshma Shaji

doi:10.1109/bigmm52142.2021.00018

Abstract

Internet has now become a platform for people to learn new things, share their opinions, and communicate with each other from anywhere. As technology is growing, the number of internet users are growing as well. With the increase in number of users, the amount of data is also enormously increasing. Social networking sites like Reddit, Facebook, Twitter have gained global popularity as a platform through which people can create individual public profiles, interact with real friends, share their interests and opinions, and post messages on any topics. Each post is tagged for filtering purposes. These tags are called flairs in the Reddit world. In this paper, a comparative data analysis using existing Machine Learning and Natural language processing techniques is provided to detect the flair of each Reddit post. Proper data analysis was done on the data using different features and a pipeline of various natural language processing techniques like Count Vectorization and Tfldf Transformation, and various machine learning techniques like K-Nearest Neighbor (KNN), Decision Tree, Support Vector Machines (SVM), and Logistic Regression was used to research on the data, and classify the flairs

Full Text