The COVID-19 pandemic has sparked widespread health-related discussions on social media platforms like Twitter (now named ‘X’). However, the lack of labeled Twitter data poses significant challenges for theme-based classification and tweet aggregation. To address this gap, we developed a machine learning-based web application that automatically classifies COVID-19 discourses into five categories: health risks, prevention, symptoms, transmission, and treatment. We collected and labeled 6,667 COVID-19-related tweets using the Twitter API, and applied various feature extraction methods to extract relevant features. We then compared the performance of seven classical machine learning algorithms (Decision Tree, Random Forest, Stochastic Gradient Descent, Adaboost, K-Nearest Neighbor, Logistic Regression, and Linear SVC) and four deep learning techniques (LSTM, CNN, RNN, and BERT) for classification. Our results show that the CNN achieved the highest precision (90.41%), recall (90.4%), F1 score (90.4%), and accuracy (90.4%). The Linear SVC algorithm exhibited the highest precision (85.71%), recall (86.94%), and F1 score (86.13%) among classical machine learning approaches. Our study advances the field of health-related data analysis and classification, and offers a publicly accessible web-based tool for public health researchers and practitioners. This tool has the potential to support addressing public health challenges and enhancing awareness during pandemics. The dataset and application are accessible at https://github.com/Bishal16/COVID19-Health-Related-Data-Classification-Website.
Read full abstract