Modeling forest restoration potential in the Scottish Highlands using multiple machine learning approaches
Abstract Natural range contraction and millennia of anthropogenic disturbance have led to a steady decline in stands of Caledonian Pine Forest (CPF) across the Scottish Highlands. Much of the land is now dominated by short-stature shrubs, with fragmented areas of native forest and commercial plantation. Several surmountable barriers exist to large-scale reforestation of the CPF, including tensions with existing economies, land ownership patterns, and the uncertainties posed by a changing climate. To address these challenges and support decision-making, we developed a data-driven approach to identify optimal sites for native forest restoration across the CPF. We trained, validated, and deployed five machine learning classification models – multilayer perceptron, naïve Bayes, random forest, support vector machine and XGBoost – to predict which sites across the CPF ecoregion were most suitable for one of three broad native forest community types: Scots pine, oak woodland and birch woodland. In the least restrictive reforestation scenario, we identified a total of 844,339 hectares of potential reforestation area while the most restrictive reforestation scenario identified 210,703, hectares. Birch, Scots pine and then Oak ranked most to least for predicted sites. Among the models, XGBoost demonstrated the highest predictive power using area under the receiver operating characteristic (AUC = 0.974, Accuracy = 0.918) while Naïve Bayes performed the least effectively (AUC = 0.794, Accuracy = 0.655). Our findings provide a spatially explicit foundation for prioritising reforestation efforts, enabling stakeholders to maximise ecological gains while navigating competing land use pressures.
- Research Article
18
- 10.1002/cpe.7190
- Jul 26, 2022
- Concurrency and Computation: Practice and Experience
SummarySpatiotemporal solar radiation forecasting is extremely challenging due to its dependence on metrological and environmental factors. Chaotic time‐varying and non‐linearity make the forecasting model more complex. To cater this crucial issue, the paper provides a comprehensive investigation of the deep learning framework for the prediction of the two components of solar irradiation, that is, Diffuse Horizontal Irradiance (DHI) and Direct Normal Irradiance (DNI). Through exploratory data analysis the three recent most prominent deep learning (DL) architecture have been developed and compared with the other classical machine learning (ML) models in terms of the statistical performance accuracy. In our study, DL architecture includes convolutional neural network (CNN) and recurrent neural network (RNN) whereas classical ML models include Random Forest (RF), Support Vector Regression (SVR), Multilayer Perceptron (MLP), Extreme Gradient Boosting (XGB), and K‐Nearest Neighbor (KNN). Additionally, three optimization techniques Grid Search (GS), Random Search (RS), and Bayesian Optimization (BO) have been incorporated for tuning the hyper parameters of the classical ML models to obtain the best results. Based on the rigorous comparative analysis it was found that the CNN model has outperformed all classical machine learning and DL models having lowest mean squared error and highest R‐Squared value with least computational time.
- Research Article
25
- 10.1007/s13755-020-00104-w
- Mar 9, 2020
- Health Information Science and Systems
Given the demand for developing the efficient Machine Learning (ML) classification models for healthcare data, and the potentiality of Bio-Inspired Optimization (BIO) algorithms to tackle the problem of high dimensional data, we investigate the range of ML classification models trained with the optimal subset of features of PD data set for efficient PD classification. We used two BIO algorithms, Genetic Algorithm (GA) and Binary Particle Swarm Optimization (BPSO), to determine the optimal subset of features of PD data set. The data set chosen for investigation comprises 756 observations (rows or records) taken over 755 attributes (columns or dimensions or features) from 252 PD patients. We employed MaxAbsolute feature scaling method to normalize the data and one hold cross-validation method to avoid biased results. Accordingly, the data is split in to training and testing set in the ratio of 70% and 30%. Subsequently, we employed GA and BPSO algorithms separately on 11 ML classifiers (Logistic Regression (LR), linear Support Vector Machine (lSVM), radial basis function Support Vector Machine (rSVM), Gaussian Naïve Bayes (GNB), Gaussian Process Classifier (GPC), k-Nearest Neighbor (kNN), Decision Tree (DT), Random Forest (RF), Multilayer Perceptron (MLP), Ada Boost (AB) and Quadratic Discriminant Analysis (QDA)), to determine the optimal subset of features (reduction of dimensionality) contributing to the highest classification accuracy. Among all the bio-inspired ML classifiers employed: GA-inspired MLP produced the maximum dimensionality reduction of 52.32% by selecting only 359 features and delivering 85.1% of the classification accuracy; GA-inspired AB delivered the maximum classification accuracy of 90.7% producing the dimensionality reduction of 41.43% by selecting only 441 features; And, BPSO-inspired GNB produced the maximum dimensionality reduction of 47.14% by selecting 396 features and delivering the classification accuracy of 79.3%; BPSOMLP delivered the maximum classification accuracy of 89% and produced 46.48% of the dimensionality reduction by selecting only 403 features.
- Research Article
5
- 10.1007/s10278-023-00957-z
- Jan 12, 2024
- Journal of imaging informatics in medicine
This paper aims to compare the performance of the classical machine learning (CML) model and the deep learning (DL) model, and to assess the effectiveness of utilizing fusion radiomics from both CML and DL in distinguishing encephalitis from glioma in atypical cases. We analysed the axial FLAIR images of preoperative MRI in 116 patients pathologically confirmed as gliomas and clinically diagnosed with encephalitis. The 3 CML models (logistic regression (LR), support vector machine (SVM) and multi-layer perceptron (MLP)), 3 DL models (DenseNet 121, ResNet 50 and ResNet 18) and a deep learning radiomic (DLR) model were established, respectively. The area under the receiver operating curve (AUC) and sensitivity, specificity, accuracy, negative predictive value (NPV) and positive predictive value (PPV) were calculated for the training and validation sets. In addition, a deep learning radiomic nomogram (DLRN) and a web calculator were designed as a tool to aid clinical decision-making. The best DL model (ResNet50) consistently outperformed the best CML model (LR). The DLR model had the best predictive performance, with AUC, sensitivity, specificity, accuracy, NPV and PPV of 0.879, 0.929, 0.800, 0.875, 0.867 and 0.889 in the validation sets, respectively. Calibration curve of DLR model shows good agreement between prediction and observation, and the decision curve analysis (DCA) indicated that the DLR model had higher overall net benefit than the other two models (ResNet50 and LR). Meanwhile, the DLRN and web calculator can provide dynamic assessments. Machine learning (ML) models have the potential to non-invasively differentiate between encephalitis and glioma in atypical cases. Furthermore, combining DL and CML techniques could enhance the performance of the ML models.
- Research Article
2
- 10.13053/cys-24-2-3367
- Jun 30, 2020
- Computación y Sistemas
Machine learning (ML) techniques have been used to classify cancer types to support physicians in the diagnosis of a disease. Usually, these models are based on structured data obtained from clinical databases. However valuable information given as clinical notes included in patient records are not used frequently. In this paper, an approach to obtain information from clinical notes, based on Natural Language Processing techniques and Paragraph Vectors algorithm is presented. Moreover, Machine Learning models for classification of liver, breast and lung cancer patients are used. Also, a comparison and evaluation process of chosen ML models with varying parameters were conducted to obtain the best one. The ML algorithms chosen are Support Vector Machines (SVM) and Multi-Layer Perceptron (MLP). Results obtained are promising and they show the best model for classification is the MLP model with aprecision 0.89 and f1-score 0.87, although the difference in precision between models is minimal (0.02).
- Research Article
5
- 10.1109/thms.2021.3059716
- Jun 1, 2021
- IEEE Transactions on Human-Machine Systems
In exercise gaming (exergaming), reward systems are typically based on rules/templates from joint movement patterns. These rules or templates need broad ranges in definitions of correct movement patterns to accommodate varying body shapes and sizes. This can lead to inaccurate rewards and, thus, inefficient exercise, which can be detrimental to progress. If exergames are to be used in serious settings like rehabilitation, accurate rewards for correctly performed movements are crucial. This article aims to investigate the level of accuracy machine learning/deep learning models can achieve in classification of correct repetitions naturally elicited from a weight-shifting exergame. Twelve healthy elderly (10F, age 70.4 SD 11.4) are recruited. Movements are captured using a marker-based 3-D motion-capture system. Random forest (RF), support vector machine, k-nearest neighbors, and multilayer perceptron (MLP) are the employed models, trained and tested on whole body movement patterns and on subsets of joints. MLP and RF reached the highest recall and F1-score, respectively, when using combined data from joint subsets. MLP recall range are 91% to 94%, and RF F1-score range 79% to 80%. MLP and RF also reached the highest recall and F1-score in each joint subset, respectively. Here, MLP ranged from 93% to 97% recall, while RF ranged from 73% to 80% F1-score. Recall results, show that >9 out of 10 repetitions are classified correctly, indicating that MLP/RF can be used to identify correctly performed repetitions of a weight-shifting exercise when using full-body data and when using joint subset data.
- Research Article
10
- 10.3390/rs16091582
- Apr 29, 2024
- Remote Sensing
The proliferation of invasive plant species poses a significant ecological threat, necessitating effective mapping strategies for control and conservation efforts. Existing studies employing unmanned aerial vehicles (UAVs) and multispectral (MS) sensors in complex natural environments have predominantly relied on classical machine learning (ML) models for mapping plant species in natural environments. However, a critical gap exists in the literature regarding the use of deep learning (DL) techniques that integrate MS data and vegetation indices (VIs) with different feature extraction techniques to map invasive species in complex natural environments. This research addresses this gap by focusing on mapping the distribution of the Broad-leaved pepper (BLP) along the coastal strip in the Sunshine Coast region of Southern Queensland in Australia. The methodology employs a dual approach, utilising classical ML models including Random Forest (RF), eXtreme Gradient Boosting (XGBoost), and Support Vector Machine (SVM) in conjunction with the U-Net DL model. This comparative analysis allows for an in-depth evaluation of the performance and effectiveness of both classical ML and advanced DL techniques in mapping the distribution of BLP along the coastal strip. Results indicate that the DL U-Net model outperforms classical ML models, achieving a precision of 83%, recall of 81%, and F1–score of 82% for BLP classification during training and validation. The DL U-Net model attains a precision of 86%, recall of 76%, and F1–score of 81% for BLP classification, along with an Intersection over Union (IoU) of 68% on the separate test dataset not used for training. These findings contribute valuable insights to environmental conservation efforts, emphasising the significance of integrating MS data with DL techniques for the accurate mapping of invasive plant species.
- Research Article
4
- 10.1186/s12905-025-03669-4
- Mar 28, 2025
- BMC Women's Health
BackgroundThe aim of this study was to develop a machine learning (ML) model for classifying osteoporosis in Korean women based on a large-scale population cohort study. This study also aimed to assess ML model performance compared with traditional osteoporosis screening tools. Furthermore, this study aimed to examine the factors influencing the risk of osteoporosis through variable importance.MethodsData was collected from 4199 women aged 40–69 years in the baseline survey of the Ansan and Ansung cohort of the Korean Genome and Epidemiology Study. Osteoporosis was set as the dependent variable to develop ML classification models. Independent variables included 122 factors related to osteoporosis risk, such as socio-demographic characteristics, anthropometric parameters, lifestyle factors, reproductive factors, nutrient intakes, diet quality indices, medical history, medication history, family history, biochemical parameters, and genetic factors. The six classification models were developed using ML techniques, including decision tree, random forest, multilayer perceptron, support vector machine, light gradient boosting machine, and extreme gradient boosting (XGBoost). The six ML classification models were compared with two traditional osteoporosis screening tools, including the osteoporosis risk assessment instrument (ORAI) and the osteoporosis self-assessment tool (OST). The ML model performances were evaluated and compared using the confusion matrix and area under the curve (AUC) metrics. Variable importance was assessed using the XGBoost technique to investigate osteoporosis risk factors.ResultsThe XGBoost model showed the highest performance out of the six ML classification models, with an accuracy of 0.705, precision of 0.664, recall of 0.830, and F1 score of 0.738. Moreover, the XGBoost model showed a higher performance on AUC than ORAI and OST. Variable importance scores were identified for 69 out of the 122 variables associated with osteoporosis risk factors. Age at menopause ranked first in variable importance. Variables of arthritis, physical activities, hypertension, education level, income level; alcohol intake, potassium intake, homeostatic model assessment for insulin resistance; energy intake, vitamin C intake, gout; and dietary inflammatory index ranked in the top 20 out of the 69 variables, using the XGBoost technique.ConclusionsThis study found that an XGBoost model can be utilized to classify osteoporosis in Korean women. Age at menopause is a significant factor in osteoporosis risk, followed by arthritis, physical activities, hypertension, and education level.
- Research Article
- 10.1038/s41598-025-16479-3
- Aug 25, 2025
- Scientific reports
Although antiretroviral therapy has prolonged the lifespan of people living with HIV, significant variations still exist in survival rates and risk factors among these people. This study compares the performance of the Cox proportional hazard models with four machine learning models in predicting the survival of people living with HIV, analyzing the survival factors among them, thereby assisting medical decision-making. We collected data on 676 people living with HIV from the Chinese Center for Disease Control and Prevention. Significant variables (p < 0.05) were identified using Cox univariate analysis. Using a random number method, the data were split into a training set (473 cases) and a test set (203 cases) in a 7:3 ratio. We employed the Cox proportional hazard model and four classification machine learning models, including eXtreme Gradient Boosting, Random Forest, Support Vector Machine, and Multilayer Perceptron, to develop survival prediction models for people living with HIV. The predictive performance of these models was evaluated based on accuracy, precision, recall, F1-score, area under the receiver operating characteristic curve (AUC), and calibration curves, and the best model was selected based on these metrics. The average age of diagnosis among the sample participants was 56.63 years (SD = 17.53). Considering the performance of both the training and testing cohorts, the Random Forest classifier emerged as the model with the best predictive performance, with an AUC of 0.912, an Accuracy of 0.862, a Precision of 0.794, a Recall of 0.562, and an F1 score of 0.659. Random Forest was followed by the Support Vector Machine, the eXtreme Gradient Boosting, Multilayer Perceptron, and the Cox proportional hazard model performed similarly. The predictive performance of machine learning models surpasses traditional Cox proportional hazard models. In China, the Random Forest model can be considered for analyzing and predicting the survival rates of people living with HIV.
- Research Article
61
- 10.1108/jicv-07-2021-0008
- Dec 27, 2021
- Journal of Intelligent and Connected Vehicles
Purpose An individual’s driving style significantly affects overall traffic safety. However, driving style is difficult to identify due to temporal and spatial differences and scene heterogeneity of driving behavior data. As such, the study of real-time driving-style identification methods is of great significance for formulating personalized driving strategies, improving traffic safety and reducing fuel consumption. This study aims to establish a driving style recognition framework based on longitudinal driving operation conditions (DOCs) using a machine learning model and natural driving data collected by a vehicle equipped with an advanced driving assistance system (ADAS). Design/methodology/approach Specifically, a driving style recognition framework based on longitudinal DOCs was established. To train the model, a real-world driving experiment was conducted. First, the driving styles of 44 drivers were preliminarily identified through natural driving data and video data; drivers were categorized through a subjective evaluation as conservative, moderate or aggressive. Then, based on the ADAS driving data, a criterion for extracting longitudinal DOCs was developed. Third, taking the ADAS data from 47 Kms of the two test expressways as the research object, six DOCs were calibrated and the characteristic data sets of the different DOCs were extracted and constructed. Finally, four machine learning classification (MLC) models were used to classify and predict driving style based on the natural driving data. Findings The results showed that six longitudinal DOCs were calibrated according to the proposed calibration criterion. Cautious drivers undertook the largest proportion of the free cruise condition (FCC), while aggressive drivers primarily undertook the FCC, following steady condition and relative approximation condition. Compared with cautious and moderate drivers, aggressive drivers adopted a smaller time headway (THW) and distance headway (DHW). THW, time-to-collision (TTC) and DHW showed highly significant differences in driving style identification, while longitudinal acceleration (LA) showed no significant difference in driving style identification. Speed and TTC showed no significant difference between moderate and aggressive drivers. In consideration of the cross-validation results and model prediction results, the overall hierarchical prediction performance ranking of the four studied machine learning models under the current sample data set was extreme gradient boosting > multi-layer perceptron > logistic regression > support vector machine. Originality/value The contribution of this research is to propose a criterion and solution for using longitudinal driving behavior data to label longitudinal DOCs and rapidly identify driving styles based on those DOCs and MLC models. This study provides a reference for real-time online driving style identification in vehicles equipped with onboard data acquisition equipment, such as ADAS.
- Research Article
52
- 10.3390/biomedinformatics2030022
- Jun 26, 2022
- BioMedInformatics
Breast cancer is a prevalent disease that affects mostly women, and early diagnosis will expedite the treatment of this ailment. Recently, machine learning (ML) techniques have been employed in biomedical and informatics to help fight breast cancer. Extracting information from data to support the clinical diagnosis of breast cancer is a tedious and time-consuming task. The use of machine learning and feature extraction techniques has significantly changed the whole process of a breast cancer diagnosis. This research work proposed a machine learning model for the classification of breast cancer. To achieve this, a support vector machine (SVM) was employed for the classification, and linear discriminant analysis (LDA) was employed for feature extraction. We measured our model’s feature extraction performance in principal component analysis (PCA) and random forest for classification. A comparative analysis of the proposed model was performed to show the effectiveness of the feature extraction, and we computed missing values based on the classifier’s accuracy, precision, and recall. The original Wisconsin Breast Cancer dataset (WBCD) and Wisconsin Prognostic Breast Cancer dataset (WPBC) were used. We evaluated performance in two phases: In phase 1, rows containing missing values were computed using the mean, and in phase 2, rows containing missing values were computed using the median. LDA–SVM when median was used to compute missing values has better results, with accuracy of 99.2%, recall of 98.0% and precision of 98.0% on the WBCD dataset and an accuracy of 79.5%, recall of 76.0% and precision of 59.0% on the WPBC dataset. The SVM classifier had a better performance in handling classification problems when LDA was applied and the median was used as a method for computing missing values.
- Research Article
- 10.26634/jpr.9.2.19086
- Jan 1, 2022
- i-manager’s Journal on Pattern Recognition
Image classification is a complex process and an important direction in the field of image processing. Image classification methods require learning and training stages. Using machine learning classification models in image classification gives better results. Decision Tree, Random Forest, Gradient Boosting, Bagging Classifier, Multi-Layer Perceptron (MLP) Classifier, and Support Vector Machine (SVM) are different machine-learning classification models. The goal of this paper is to analyze the machine learning classification models. These models classify 12 kinds of plant seedlings, of which 3 are crop seedlings and 9 are weed seedlings. This paper suggests that, when using a V2 Plant Seedlings dataset, the accuracy of SVM is 0.71 and the accuracy of other models is less compared to SVM. The experimental results in this paper show that the machine learning model SVM has a better solution effect and higher recognition accuracy. This paper focuses on model building, training, and assessing the quality of the model by generating a confusion matrix and a classification report.
- Research Article
46
- 10.3390/rs12101620
- May 19, 2020
- Remote Sensing
Rice is an important agricultural crop in the Southwest Hilly Area, China, but there has been a lack of efficient and accurate monitoring methods in the region. Recently, convolutional neural networks (CNNs) have obtained considerable achievements in the remote sensing community. However, it has not been widely used in mapping a rice paddy, and most studies lack the comparison of classification effectiveness and efficiency between CNNs and other classic machine learning models and their transferability. This study aims to develop various machine learning classification models with remote sensing data for comparing the local accuracy of classifiers and evaluating the transferability of pretrained classifiers. Therefore, two types of experiments were designed: local classification experiments and model transferability experiments. These experiments were conducted using cloud-free Sentinel-2 multi-temporal data in Banan District and Zhongxian County, typical hilly areas of Southwestern China. A pure pixel extraction algorithm was designed based on land-use vector data and a Google Earth Online image. Four convolutional neural network (CNN) algorithms (one-dimensional (Conv-1D), two-dimensional (Conv-2D) and three-dimensional (Conv-3D_1 and Conv-3D_2) convolutional neural networks) were developed and compared with four widely used classifiers (random forest (RF), extreme gradient boosting (XGBoost), support vector machine (SVM) and multilayer perceptron (MLP)). Recall, precision, overall accuracy (OA) and F1 score were applied to evaluate classification accuracy. The results showed that Conv-2D performed best in local classification experiments with OA of 93.14% and F1 score of 0.8552 in Banan District, OA of 92.53% and F1 score of 0.8399 in Zhongxian County. CNN-based models except Conv-1D provided more desirable performance than non-CNN classifiers. Besides, among the non-CNN classifiers, XGBoost received the best result with OA of 89.73% and F1 score of 0.7742 in Banan District, SVM received the best result with OA of 88.57% and F1 score of 0.7538 in Zhongxian County. In model transferability experiments, almost all CNN classifiers had low transferability. RF and XGBoost models have achieved acceptable F1 scores for transfer (RF = 0.6673 and 0.6469, XGBoost = 0.7171 and 0.6709, respectively).
- Research Article
- 10.1080/15481603.2025.2612306
- Dec 31, 2026
- GIScience & Remote Sensing
Rapid and accurate flood mapping and monitoring are essential for effective disaster management, response, and recovery, as well as for advancing hydrological sciences. Despite the increasing availability of geospatial data and cloud platforms like Google Earth Engine (GEE), existing GEE-based flood mapping applications largely depend on traditional thresholding techniques requiring manual input or struggle with latency due to loose-coupling of complex models. This study introduces DeepSAR Flood Mapper, a novel, fully automated deep learning-based flood mapping application on the GEE cloud platform as an operational, publicly accessible tool, providing interactive and near-real-time capabilities globally. DeepSAR Flood Mapper utilizes a pre-trained Multilayer Perceptron (MLP) deep learning model, selected for its computational efficiency and ability to model highly nonlinear functions, facilitating seamless integration with GEE. The model integrates two crucial input datasets: Sentinel-1 Synthetic Aperture Radar imagery (VV and VH polarization) for all-weather surface water detection, and Height Above the Nearest Drainage topographic data to mitigate commission errors in elevated areas and enhance reliability. Trained on a combination of global benchmark datasets and historical flood maps, the MLP model is deployed using an Offline Training and Online Prediction coupling strategy, which eliminates data transfer bottlenecks and allows for seamless, on-demand prediction within GEE. The application features an intuitive user interface that allows users to define an Area of Interest and target date, requiring no specialized knowledge. Performance evaluation demonstrates that DeepSAR Flood Mapper significantly improves flood mapping accuracy compared to traditional approaches, including Otsu’s thresholding and classical machine learning models, Support Vector Machines and Random Forests. Its near-real-time capability supports timely and scalable flood monitoring across diverse geographic regions worldwide. The DeepSAR Flood Mapper application is publicly accessible online at: https://ee-tiandan-gee.projects.earthengine.app/view/deepsar-flood-mapper.
- Research Article
- 10.1038/s41598-025-19167-4
- Oct 8, 2025
- Scientific Reports
Claustrophobia, a phobia with a specific unreasonable and excessive fear of enclosed spaces, can have a considerable impact on an individual’s life. Electroencephalography (EEG) has been a tool with potential for studying neural processes in anxiety disorders including claustrophobia. In this work, a machine learning algorithm for differentiation between claustrophobic and healthy controls using EEG signals is presented. EEG data were collected from 22 participants under controlled conditions, and preprocessing included filtering, artifact removal, and feature extraction using relative Power Spectral Density (rPSD) across five frequency bands: delta, theta, alpha, beta, and gamma. Classical machine learning models such as Support Vector Machine (SVM), k-Nearest Neighbors (k-NN), Decision Tree (DT), and Random Forest (RF) were applied to assess their suitability in this domain. In addition, deep learning models including Multi-Layer Perceptron (MLP) and a Convolutional Neural Network with Bidirectional Long Short-Term Memory (CNN-BiLSTM) integration were utilized for their capacity to capture complex temporal and spatial patterns in EEG data. Performance testing with five-fold cross-validation revealed that MLP and CNN-BiLSTM performed best in terms of accuracy in classification, with a 95.15% ± 0.77 when all bands of frequencies were combined together. An analysis of brain regions revealed frontal and temporal regions to differentiate between claustrophobic and non-claustrophobic subjects, and beta and theta bands played a significant role in distinguishing between them. These observations unveil high potential for EEG-based machine learning algorithms in objective evaluation of claustrophobia, and propose opportunities for future development in its therapy and diagnosis.
- Research Article
55
- 10.3390/telecom3020019
- May 27, 2022
- Telecom
The use of Machine Learning (ML) and Sentiment Analysis (SA) on data from microblogging sites has become a popular method for stock market prediction. In this work, we developed a model for predicting stock movement utilizing SA on Twitter and StockTwits data. Stock movement and sentiment data were used to evaluate this approach and validate it on Microsoft stock. We gathered tweets from Twitter and StockTwits, as well as financial data from Finance Yahoo. SA was applied to tweets, and seven ML classification models were implemented: K-Nearest Neighbors (KNN), Support Vector Machine (SVM), Logistic Regression (LR), Naïve Bayes (NB), Decision Tree (DT), Random Forest (RF) and Multilayer Perceptron (MLP). The main novelty of this work is that it integrates multiple SA and ML methods, emphasizing the retrieval of extra features from social media (i.e., public sentiment), for improving stock prediction accuracy. The best results were obtained when tweets were analyzed using Valence Aware Dictionary and sEntiment Reasoner (VADER) and SVM. The top F-score was 76.3%, while the top Area Under Curve (AUC) value was 67%.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.