SENTIMENT ANALYSIS AND CLASSIFICATION OF AIR INDIA FLIGHT INCIDENT USING YOUTUBE COMMENTS

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

The usage of social media to interchange ideas and facts has increased exponentially due to technological advancements. Platforms for video sharing, like YouTube, have distinctive environments and architecture that people use for entertainment, education, and to keep themselves updated. YouTube is one of the most frequently used social media platforms, and users can connect to it by viewing, sharing opinions through comments, liking and disliking videos. A viewpoint or judgement formed about anything is referred to as an opinion. It can be collected and used to check knowledge, suggest the author with new video ideas, and analyze user behaviour. In this study, the data extracted from the free video-sharing platform YouTube concerning the ‘Air India Flight Urination Case’ was observed recently to recognize people’s opinions on national and international levels. Based on approximately 10,000 comments about the incident, models are applied to classify and investigate the sentiments. This investigation uses TF-IDF and Bag of Words (BoW) text modelling techniques and observed that BoW performs better than TF-IDF. Moreover, Naive Bayes, Logistic Regression, Decision Tree, Support Vector Machines, and some ensemble algorithms like Random Forests, Gradient Boost, and Voting Classifier combining (Support Vector Machine, Decision Tree, Logistic Regression and Random Forest) with soft and hard voting had been applied and found that Support Vector Machine has the highest classification accuracy of 84%.

Similar Papers
  • Research Article
  • Cite Count Icon 1
  • 10.25126/jtiik.938663
Analisis Sentimen untuk Evaluasi Reputasi Merek Motor XYZ Berkaitan dengan Isu Rangka Motor di Twitter Menggunakan Pendekatan <i>Machine Learning</i>
  • Jul 31, 2024
  • Jurnal Teknologi Informasi dan Ilmu Komputer
  • Ferdian Maulana Akbar + 5 more

Motor XYZ mengeluarkan inovasi rangka motor yang diperkenalkan pada tahun 2019. Sekitar Agustus 2023, beredar rumor di media sosial yang menyatakan bahwa rangka tersebut mengalami karat, korosi, dan retak, menyebabkan kekhawatiran di kalangan masyarakat yang tentunya hal ini berpotensi merugikan reputasi merek XYZ. Sasaran utama dari studi ini adalah mengevaluasi pandangan masyarakat di platform Twitter pada Motor XYZ, khususnya pada perbincangan seputar isu rangka motor. Data yang digunakan merupakan data yang diambil teknik crawling dengan periode tweets dari Agustus hingga November 2023. Penelitian ini akan memanfaatkan analisis sentimen menggunakan word cloud, analisis tren dan distribusi, dan pembandingan lima algoritma machine learning, yakni Naïve Bayes, Decision Tree, Support Vector Machine, Logistic Regression, dan Random Forest. Penelitian ini bertujuan untuk mengidentifikasi algoritma dengan performa terbaik untuk mengategorikan tweets dan memberikan rekomendasi kepada Motor XYZ terkait reputasi merek dalam hubungannya dengan isu rangka motor. Hasil penelitian menunjukkan bahwa model klasifikasi sentimen dengan kinerja terbaik setelah hyperparameter tuning adalah Random Forest, dengan F1 score sebesar 0,765. Selain itu, rekomendasi yang dapat diberikan adalah meningkatkan kesadaran tentang pemeriksaan rangka gratis karena telah terbukti berdampak positif pada sentimen masyarakat di Twitter. Perlu ditekankan bahwa dalam penelitian ini tidak ada pertimbangan terhadap proses deployment model machine learning dan pembuatan dashboard. Selain itu, penelitian ini tidak menangani analisis reputasi atau sentimen merek di platform media sosial lain seperti TikTok atau Instagram. Abstract Motor XYZ introduced an innovative motorcycle frame in 2019. In August 2023, rumors began circulating on social media that these frames were experiencing rust, corrosion, and cracks. This caused public concern and potentially harmed the XYZ brand's reputation. This study aims to evaluate public opinion on Twitter regarding the motorcycle frame issue. Data was collected using crawling techniques from tweets posted between August and November 2023. We used sentiment analysis with word clouds, trend and distribution analysis, and compared five machine learning algorithms: Naïve Bayes, Decision Tree, Support Vector Machine, Logistic Regression, and Random Forest. The goal was to identify the best algorithm for categorizing tweets and provide recommendations to Motor XYZ about their brand reputation concerning the frame issue. Results showed that the Random Forest model, after hyperparameter tuning, had the best performance with an F1 score of 0.765. This study recommend increasing awareness about free frame inspections, as this positively impacted public sentiment on Twitter. Note that this study does not include the deployment process of the machine learning model or dashboard creation, nor does it address brand reputation or sentiment analysis on other social media platforms such as TikTok or Instagram.

  • Research Article
  • 10.1002/cpe.70325
Soil Nutrient Analysis and Yield Prediction With Neuro‐ ML Ensemble Model Using IoT ‐ WSN Approach: In Context to India's Agricultural Sector
  • Oct 21, 2025
  • Concurrency and Computation: Practice and Experience
  • Sandeep Bhatia + 2 more

Agriculture is a backbone of the Indian economy and people's lives. In agriculture land, soil is the most important element on which the quality of production and efficiency depends to the maximum extent. Phosphorus (P), Nitrogen (N), Potassium (K), and the potential of hydrogen (pH) are the key nutrients in soil. An efficient crop recommender and prediction system is needed to optimize agriculture practices considering the escalating demand for more food. Traditional time‐consuming and manual farming should be replaced with a smart agriculture framework using the integration of technologies like the Internet of Things (IoT), Wireless Sensor Network (WSN), and Machine Learning (ML). This paper proposed an IoT‐WSN driven crop management system with Neuro‐ML Ensemble Model, utilizing LoRaWAN Gateway, that can be deployed in the agriculture field to collect real‐time soil parameters. In this paper for soil nutrient analysis, the author used various ML algorithms such as Naive Bayes (NB), Logistic Regression (LR), K‐Nearest Neighbor (KNN), Decision Tree (DT), Random Forest (RF), Ada Boost (AB), Gradient Boosting (GB), and Support Vector Machine (SVM) and recommending a suitable ML algorithm for the crop recommender system. For crop yield prediction, the author has developed and recommended a customized GB Algorithm with an accuracy of 98.80%, and for the fertilizer recommendation system, the author has suggested CNN‐BiGRU which outperforms other approaches like BiGRU and CNN with an average accuracy rate of 92.48%. The author presented work with respect to the Indian agriculture sector and compared ML algorithms with state‐of‐the‐art datasets available on some government websites of India, and used by other authors, with a dataset collected by the author from hardware using Raspberry Pi. For crop recommendation and forecasting, the Neuro‐ML Ensemble model employs the Neuro‐ML, which combines neural networks (NN) with the ML models. This research aspires to assist farmers in opting for suitable crops as per their environmental suitability and situation by analyzing and predicting which crops suit well to fit the parameters required to enhance crop growth like soil nutrients, soil moisture, soil pH, and rainfall, etc. The author obtained accuracy for various ML models used in the framework. For NB, LR, KNN, SVM, DT, and RF, the author obtained accuracies of 99.54%, 96.36%, 95.90%, 96.81%, 98.86%, and 99.31%, respectively, using the Kaggle dataset available as open access. Through a dataset collected by the authors, we obtained accuracies of 94.54%, 91.36%, 92.72%, 92.73%, 86.36%, and 94.54% for NB, LR, KNN, SVM, DT, and RF, respectively. The author found that Naive Bayes (NB) outperforms the other machine learning algorithms, such as KNN, SVM, LR, Decision Tree, RF, and AB, and is the best algorithm suited for crop yield.

  • Research Article
  • 10.32628/cseit24105103
Sentiment Analysis of Political Parties on social media: A Machine Learning and Lexicon-Based Approach
  • Nov 1, 2024
  • International Journal of Scientific Research in Computer Science, Engineering and Information Technology
  • Mr Swapnil P Goje + 1 more

Social media platforms like Facebook, Twitter, Instagram, and YouTube have become central to communication and entertainment, with users sharing opinions on various topics. These opinions, often categorized as positive, negative, or neutral sentiments, provide valuable data for sentiment analysis. Our research analyzed political YouTube comments related to India’s Bhartiya Janata Party (BJP) and Indian National Congress (INC) using a combination of the AFINN lexicon and machine learning techniques. We applied feature representation methods such as Bag of Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF), alongside five machine learning algorithms: Multinomial Naïve Bayes, Logistic Regression, Random Forest, Support Vector Machine (SVM), and K-nearest neighbor (K-NN). We aimed to determine the most efficient sentiment analysis approach by comparing the performance of these models using standard evaluation metrics. For the BJP dataset, Logistic Regression performed best with BoW, while SVM was most effective with TF-IDF. Similarly, for the INC dataset, Random Forest excelled with BoW, and SVM outperformed others with TF-IDF. The AFINN lexicon showed poor performance across both datasets, and K-NN consistently achieved lower accuracy. Our findings suggest that SVM and Random Forest are more suitable for political sentiment analysis.

  • Research Article
  • Cite Count Icon 70
  • 10.1007/s11657-020-00802-8
Application of machine learning approaches for osteoporosis risk prediction in postmenopausal women.
  • Oct 23, 2020
  • Archives of Osteoporosis
  • Jae-Geum Shim + 6 more

Osteoporosis is a silent disease until it results in fragility fractures. However, early diagnosis of osteoporosis provides an opportunity to detect and prevent fractures. We aimed to develop machine learning approaches to achieve high predictive ability for osteoporosis risk that could help primary care providers identify which women are at increased risk of osteoporosis and should therefore undergo further testing with bone densitometry. We included all postmenopausal Korean women from the Korea National Health and Nutrition Examination Surveys (KNHANES V-1, V-2) conducted in 2010 and 2011. Machine learning models using methods such as the k-nearest neighbors (KNN), decision tree (DT), random forest (RF), gradient boosting machine (GBM), support vector machine (SVM), artificial neural networks (ANN), and logistic regression (LR) were developed to predict osteoporosis risk. We analyzed the effect of applying the machine learning algorithms to the raw data and featuring the selected data only where the statistically significant variables were included as model inputs. The accuracy, sensitivity, specificity, and area under the receiver operating characteristic curve (AUROC) were used to evaluate performance among the seven models. A total of 1792 patients were included in this study, of which 613 had osteoporosis. The raw data consisted of 19 variables and achieved performances (in terms of AUROCs) of 0.712, 0.684, 0.727, 0.652, 0.724, 0.741, and 0.726 for KNN, DT, RF, GBM, SVM, ANN, and LR with fivefold cross-validation, respectively. The feature selected data consisted of nine variables and achieved performances (in terms of AUROCs) of 0.713, 0.685, 0.734, 0.728, 0.728, 0.743, and 0.727 for KNN, DT, RF, GBM, SVM, ANN, and LR with fivefold cross-validation, respectively. In this study, we developed and compared seven machine learning models to accurately predict osteoporosis risk. The ANN model performed best when compared to the other models, having the highest AUROC value. Applying the ANN model in the clinical environment could help primary care providers stratify osteoporosis patients and improve the prevention, detection, and early treatment of osteoporosis.

  • Research Article
  • 10.1145/3695251
Empowering Digital Civility with an NLP Approach for Detecting 𝕏 (Formerly Known as Twitter) Cyberbullying through Boosted Ensembles
  • Nov 23, 2024
  • ACM Transactions on Asian and Low-Resource Language Information Processing
  • Senthil Prabakaran + 2 more

As the number of social networking sites grows, so do cyber dangers. Cyberbullying is harmful behavior that uses technology to intimidate, harass, or harm someone, often on social media platforms like 𝕏 (formerly known as Twitter). Machine learning is the optimal approach for cyberbullying detection on 𝕏 to process large amounts of data, identify patterns of offensive behavior, and automate the detection process for corpus of tweets. To identify cyber threats using a trained model, the boosted ensemble (BE) technique is assessed with various machine learning algorithms such as the convolutional neural network (CNN), long short-term memory (LSTM), naive Bayes (NB), decision tree (DT), support vector machine (SVM), bidirectional LSTM (BILSTM), recurrent neural network LSTM (RNN-LSTM), multi-modal cyberbullying detection (MMCD), and random forest (RF). These classifiers are trained on the vectorized data to classify the tweets to identify cyberbullying threats. The proposed framework can detect cyberbullying cases precisely on tweets. The significance of the work lies in detecting and mitigating cyber threats in real time, and it impacts in enhancing the safety and well-being of social media users by reducing instances of cyberbullying and other cyber threats. The comparative analysis is done using metrics like accuracy, precision, recall, and F1-score, and the comparison results show that the BE technique outperforms other compared algorithms with its overall performance. Respectively, the accuracy rates of CNN, LSTM, NB, DT, SVM, RF, BILSTM, and BE are 92.5%, 93.5%, 84.6%, 88%, 89.3%, 92%, 93.75%, and 96%; precision rates of CNN, LSTM, NB, DT, SVM, RF, RNN-LSTM, and BE are 90.2%, 91.3%, 88%, 85%, 86%, 91.6%, 92.1%, and 94%; recall rates of CNN, LSTM, NB, DT, SVM, RF, BILSTM, and BE are 89.8%, 90.7%, 90%, 82%, 88.67%, 89%, 91.04%, and 93.7%; and F1-scores of CNN, LSTM, NB, DT, SVM, RF, MMCD, and BE are 90.6%, 91.8%, 85%, 84.56% 87.2%, 90%, 84.6%, and 94.89%.

  • Research Article
  • Cite Count Icon 3
  • 10.32629/jai.v6i2.623
Experiences of sexual minorities on social media: A study of sentiment analysis and machine learning approaches
  • Aug 4, 2023
  • Journal of Autonomous Intelligence
  • Peter Appiahene + 3 more

<p>Nowadays, social media has become a forum for people to express their views on issues such as sexual orientation, legislation, and taxes. Sexual orientation refers to individuals with whom you are attracted and wish to be engaged. In the world, many people are regarded as having different sexual orientations. People categorized as lesbian, gay, bisexual, transgender, queer, and many more (LGBTQ+) have many sexual orientations. Because of the public stigmatization of LGBTQ+ persons, many turn to social media to express themselves, sometimes anonymously. The present study aims to use natural language processing (NLP) and machine learning (ML) approaches to assess the experiences of LGBTQ+ persons. To train the data, the study used lexicon-based sentiment analysis (SA) and six distinct machine classifiers, including logistic regression (LR), support vector machine (SVM), naïve bayes (NB), decision tree (DT), random forest (RF), and gradient boosting (GB). Individuals are positive about LGBTQ concerns, according to the SA results; yet, prejudice and harsh statements against the LGBTQ people persist in many regions where they live, according to the negative sentiment ratings. Furthermore, using LR, SVM, NB, DT, RF, and GB, the ML classifiers attained considerable accuracy values of 97%, 96%, 88%, 100%, 92%, and 91%, respectively. The performance assessment metrics used obtained significant recall and precision values. This study will assist the government, non-governmental organizations, and rights advocacy groups make educated decisions about LGBTQ+ concerns in order to ensure a sustainable future and peaceful coexistence.</p>

  • Research Article
  • Cite Count Icon 14
  • 10.1016/j.neuri.2024.100169
An ensemble machine learning-based approach to predict cervical cancer using hybrid feature selection
  • Aug 10, 2024
  • Neuroscience Informatics
  • Khandaker Mohammad Mohi Uddin + 4 more

An ensemble machine learning-based approach to predict cervical cancer using hybrid feature selection

  • Research Article
  • 10.1016/j.wneu.2025.124322
Kissing Spine and Other Imaging Predictors of Postoperative Cement Displacement Following Percutaneous Kyphoplasty: A Machine Learning Approach.
  • Jul 1, 2025
  • World neurosurgery
  • Yinglun Zhao + 7 more

Kissing Spine and Other Imaging Predictors of Postoperative Cement Displacement Following Percutaneous Kyphoplasty: A Machine Learning Approach.

  • Research Article
  • Cite Count Icon 35
  • 10.1002/mp.14699
Detecting MLC modeling errors using radiomics-based machine learning in patient-specific QA with an EPID for intensity-modulated radiation therapy.
  • Jan 27, 2021
  • Medical Physics
  • Madoka Sakai + 13 more

We sought to develop machine learning models to detect multileaf collimator (MLC) modeling errors with the use of radiomic features of fluence maps measured in patient-specific quality assurance (QA) for intensity-modulated radiation therapy (IMRT) with an electric portal imaging device (EPID). Fluence maps measured with EPID for 38 beams from 19 clinical IMRT plans were assessed. Plans with various degrees of error in MLC modeling parameters [i.e., MLC transmission factor (TF) and dosimetric leaf gap (DLG)] and plans with an MLC positional error for comparison were created. For a total of 152 error plans for each type of error, we calculated fluence difference maps for each beam by subtracting the calculated maps from the measured maps. A total of 837 radiomic features were extracted from each fluence difference map, and we determined the number of features used for the training dataset in the machine learning models by using random forest regression. Machine learning models using the five typical algorithms [decision tree, k-nearest neighbor (kNN), support vector machine (SVM), logistic regression, and random forest] for binary classification between the error-free plan and the plan with the corresponding error for each type of error were developed. We used part of the total dataset to perform fourfold cross-validation to tune the models, and we used the remaining test dataset to evaluate the performance of the developed models. A gamma analysis was also performed between the measured and calculated fluence maps with the criteria of 3%/2 and 2%/2mm for all of the types of error. The radiomic features and its optimal number were similar for the models for the TF and the DLG error detection, which was different from the MLC positional error. The highest sensitivity was obtained as 0.913 for the TF error with SVM and logistic regression, 0.978 for the DLG error with kNN and SVM, and 1.000 for the MLC positional error with kNN, SVM, and random forest. The highest specificity was obtained as 1.000 for the TF error with a decision tree, SVM, and logistic regression, 1.000 for the DLG error with a decision tree, logistic regression, and random forest, and 0.909 for the MLC positional error with a decision tree and logistic regression. The gamma analysis showed the poorest performance in which sensitivities were 0.737 for the TF error and the DLG error and 0.882 for the MLC positional error for 3%/2mm. The addition of another type of error to fluence maps significantly reduced the sensitivity for the TF and the DLG error, whereas no effect was observed for the MLC positional error detection. Compared to the conventional gamma analysis, the radiomics-based machine learning models showed higher sensitivity and specificity in detecting a single type of the MLC modeling error and the MLC positional error. Although the developed models need further improvement for detecting multiple types of error, radiomics-based IMRT QA was shown to be a promising approach for detecting the MLC modeling error.

  • Research Article
  • 10.59413/ajocs/v6.i.1.4
Comparative Analysis of Machine Learning Algorithms for Enhancing Social Media Marketing and Decision-Making in Kenyan SMEs.
  • Jan 7, 2025
  • African Journal of Commercial Studies
  • Christopher Fred

Small and medium-sized enterprises (SMEs) in Kenya are crucial to the nation's economic advancement, yet they sometimes have difficulties competing in a rapidly digitalizing market due to limited resources and inadequate marketing strategies. Social media platforms such as Facebook, Instagram, and X (formerly Twitter) are essential tools for cost-effective marketing; nevertheless, many SMEs fail to leverage their potential due to a lack of data-driven strategy. Machine Learning (ML) algorithms offer a transformative method for SMEs to examine social media data, enhance campaigns, and refine decision-making. This research conducts a comparative analysis of five prominent machine learning algorithms: Logistic Regression, Decision Trees, Random Forests, Support Vector Machines (SVM), and Neural Networks, with the objective of improving social media marketing campaigns and decision-making for SMEs in Kenya. The researchers assess the effectiveness of these algorithms in critical marketing functions, including consumer segmentation, sentiment analysis, and campaign optimization. A dataset comprising engagement indicators, customer profiles, and campaign performance metrics from Kenyan SMEs was used to evaluate the algorithms' accuracy, precision, recall, F1 score, and computational efficiency. The findings demonstrate that Random Forests strike a balance between accuracy and computational efficiency, making them a feasible choice for small and medium-sized enterprises with constrained resources. Logistic Regression is cost-effective and suitable for basic jobs, while Neural Networks are proficient at handling unstructured data but require significant computer resources. Decision trees, despite being understandable and user-friendly, are prone to overfitting, whereas support vector machines, although effective for small datasets, require significant computational resources for large-scale applications. The research indicates that significant challenges, such as insufficient technical expertise, elevated computing expenses, and data privacy issues, hinder the use of machine learning by small and medium-sized enterprises in Kenya. It also highlights the potential of cloud-based machine learning platforms, support from the government and private sectors for SME training, and partnerships to improve the accessibility of machine learning solutions. This research contributes to the growing body of knowledge on the application of ML in marketing and provides actionable recommendations for Kenyan SMEs to harness ML technologies for improved social media marketing and informed decision-making.

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/icabme53305.2021.9604816
Artificial Intelligence Framework for COVID19 Patients Monitoring
  • Oct 7, 2021
  • Sandy Rihana + 1 more

The current global spread of COVID-19, a highly contagious disease, has challenged healthcare systems and placed immense burdens on medical staff globally. Almost 5% to 10% among hospitalized patients will require ICU admission. Predicting ICU admission can help in managing better the patient and the healthcare system. This study aims to develop a model that can predict whether a COVID-19 patient, who has already been admitted to the hospital, will enter the ICU or not. This could be accomplished by monitoring his vital signs, and blood tests, and inquiring about his demographic records, during his stay in the hospital. Multiple models, including Artificial Neural Networks, Logistic Regression, Decision Tree, Random Forest, Gaussian Naïve Bayes, Gradient Boosting, and Support Vector Machines, were designed and implemented using MATLAB and Python. Random Forest, Decision Tree, and Gradient Boosting, are examples of decision tree-based algorithms that outperformed the others. The Random Forest (Accuracy: 99.12%, Cross-Validation Accuracy 86.34%) and Decision Tree (Accuracy: 99.12%, Cross-Validation Accuracy 79.48%) and Gradient Boosting (Accuracy: 93.77%, Cross-Validation Accuracy: 86.96%) had the highest accuracy scores as compared to other models such as the Support Vector Machines (Accuracy: 87.74%, Cross-Validation Accuracy 72.42%). In future work, the aim will be to predict whether a patient will join ICU or not, based on monitoring for multiple windows. As a result, high accuracy scores will be reached, since the model will analyze the vital signs and laboratory data at multiple stages and timings. In this way, anticipating the requirement for ICU admission well ahead of time.

  • Research Article
  • 10.32350/icr.32.05
Sentiment Analysis of Roman Urdu Text Using Machine Learning Techniques
  • Dec 5, 2023
  • Innovative Computing Review
  • Mubasher Malik + 1 more

Social media has attained popularity during the last few decades due to the rapid growth of online businesses and social interaction. People can interact with one another and communicate their sentiments by expressing their ideas and points of view on social media. Businesses involved in manufacturing, sales, and marketing increasingly focus on social media to get feedback on their goods and services from people worldwide. Businesses must process and analyze this feedback in the form of sentiments to gain business insights. Every day, millions of Urdu and Roman Urdu sentences are posted on social media platforms. The critical loss of this massive amount of data results from ignoring the thoughts and opinions in language with limited resources, such as Urdu and Roman Urdu in the favor of resource-rich languages, such as English. The current study focused on sentiment analysis of Roman Urdu text. Bag of Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF) word embedding techniques were deployed to conduct the current study. Support Vector Machine (SVM), Linear Support Vector Machine (LSVC), Logistic Regression (LR), and Random Forest (RF) classifiers were deployed. The experiments showed that SVM showed 94.74%, while RF showed 93.13% accuracy using BoW word embedding technique

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 8
  • 10.3390/app13074570
Classification of Virtual Harassment on Social Networks Using Ensemble Learning Techniques
  • Apr 4, 2023
  • Applied Sciences
  • Nureni Ayofe Azeez + 1 more

Background: Internet social media platforms have become quite popular, enabling a wide range of online users to stay in touch with their friends and relatives wherever they are at any time. This has led to a significant increase in virtual crime from the inception of these platforms to the present day. Users are harassed online when confidential information about them is stolen, or when another user posts insulting or offensive comments about them. This has posed a significant threat to online social media users, both mentally and psychologically. Methods: This research compares traditional classifiers and ensemble learning in classifying virtual harassment in online social media networks by using both models with four different datasets: seven machine learning algorithms (Nave Bayes NB, Decision Tree DT, K Nearest Neighbor KNN, Logistics Regression LR, Neural Network NN, Quadratic Discriminant Analysis QDA, and Support Vector Machine SVM) and four ensemble learning models (Ada Boosting, Gradient Boosting, Random Forest, and Max Voting). Finally, we compared our results using twelve evaluation metrics, namely: Accuracy, Precision, Recall, F1-measure, Specificity, Matthew’s Correlation Coefficient (MCC), Cohen’s Kappa Coefficient KAPPA, Area Under Curve (AUC), False Discovery Rate (FDR), False Negative Rate (FNR), False Positive Rate (FPR), and Negative Predictive Value (NPV) were used to show the validity of our algorithms. Results: At the end of the experiments, For Dataset 1, Logistics Regression had the highest accuracy of 0.6923 for machine learning algorithms, while Max Voting Ensemble had the highest accuracy of 0.7047. For dataset 2, K-Nearest Neighbor, Support Vector Machine, and Logistics Regression all had the same highest accuracy of 0.8769 in the machine learning algorithm, while Random Forest and Gradient Boosting Ensemble both had the highest accuracy of 0.8779. For dataset 3, the Support Vector Machine had the highest accuracy of 0.9243 for the machine learning algorithms, while the Random Forest ensemble had the highest accuracy of 0.9258. For dataset 4, the Support Vector Machine and Logistics Regression both had 0.8383, while the Max voting ensemble obtained an accuracy of 0.8280. A bar chart was used to represent our results, showing the minimum, maximum, and quartile ranges. Conclusions: Undoubtedly, this technique has assisted in no small measure in comparing the selected machine learning algorithms as well as the ensemble for detecting and exposing various forms of cyber harassment in cyberspace. Finally, the best and weakest algorithms were revealed.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 37
  • 10.3390/math8050851
Comparison of Supervised Classification Models on Textual Data
  • May 24, 2020
  • Mathematics
  • Bi-Min Hsu

Text classification is an essential aspect in many applications, such as spam detection and sentiment analysis. With the growing number of textual documents and datasets generated through social media and news articles, an increasing number of machine learning methods are required for accurate textual classification. For this paper, a comprehensive evaluation of the performance of multiple supervised learning models, such as logistic regression (LR), decision trees (DT), support vector machine (SVM), AdaBoost (AB), random forest (RF), multinomial naive Bayes (NB), multilayer perceptrons (MLP), and gradient boosting (GB), was conducted to assess the efficiency and robustness, as well as limitations, of these models on the classification of textual data. SVM, LR, and MLP had better performance in general, with SVM being the best, while DT and AB had much lower accuracies amongst all the tested models. Further exploration on the use of different SVM kernels was performed, demonstrating the advantage of using linear kernels over polynomial, sigmoid, and radial basis function kernels for text classification. The effects of removing stop words on model performance was also investigated; DT performed better with stop words removed, while all other models were relatively unaffected by the presence or absence of stop words.

  • Research Article
  • Cite Count Icon 2
  • 10.4314/njtd.v19i2.10
Identification of pharming in communication networks using ensemble learning
  • Aug 1, 2022
  • Nigerian Journal of Technological Development
  • N A Azeez + 2 more

Pharming scams are carried out by exploiting the DNS as the main weapon while phishing attacks employ spoofed websites that appear to be legitimate to internet users. Phishing makes use of baits such as fake links but pharming leverages and negotiates on the DNS server to move and redirect internet users to a fake and simulated website.Having seen several challenges through pharming resulting into vulnerable websites, personal emails and accounts on social media, the usage and reliability on internet calls for caution. Against this backdrop, this work aims at enhancing pharming detection strategies by adopting machine learning classification algorithms. To further obtain the best classification results, an ensemble learning approach was adopted. The algorithms used include K-Nearest Neighbors (KNN), Decision Tree, Random Forest, Gaussian Naive Bayes, Logistic Regression, Support Vector Machine, Adaptive Boosting, Gradient Boosting, and Extra Trees Classifier. During the testing process, the classifiers were tested against four popular metrics: accuracy, recall, precision, F1 score, and Log loss. The results demonstrate the performance of all algorithms used, as well as their relationships. The ensemble model that included Logistic Regression, K-Nearest Neighbors, Decision Tree, Support Vector Machine, Gradient Boosting Classifier, AdaBoost Classifier, Extra Trees Classifier, and Random Forest produced the best results after evaluating them on the two datasets. Random Forest Classifiers showed a better performance of the classifiers, with mean accuracies of 0.932 and 0.939, respectively for each of the datasets when compared to 0.476 and 0.519 obtained for Naive Bayes.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.