An Informative Analysis of Applying Feature Reduction Methods to Supervised Machine Learning Algorithms
An Informative Analysis of Applying Feature Reduction Methods to Supervised Machine Learning Algorithms
- Research Article
6
- 10.1186/s40798-024-00788-4
- Nov 14, 2024
- Sports Medicine - Open
Supervised machine learning (ML) offers an exciting suite of algorithms that could benefit research in sport science. In principle, supervised ML approaches were designed for pure prediction, as opposed to explanation, leading to a rise in powerful, but opaque, algorithms. Recently, two subdomains of ML–explainable ML, which allows us to “peek into the black box,” and interpretable ML, which encourages using algorithms that are inherently interpretable–have grown in popularity. The increased transparency of these powerful ML algorithms may provide considerable support for the hypothetico-deductive framework, in which hypotheses are generated from prior beliefs and theory, and are assessed against data collected specifically to test that hypothesis. However, this paper shows why ML algorithms are fundamentally different from statistical methods, even when using explainable or interpretable approaches. Translating potential insights from supervised ML algorithms, while in many cases seemingly straightforward, can have unanticipated challenges. While supervised ML cannot be used to replace statistical methods, we propose ways in which the sport sciences community can take advantage of supervised ML in the hypothetico-deductive framework. In this manuscript we argue that supervised machine learning can and should augment our exploratory investigations in sport science, but that leveraging potential insights from supervised ML algorithms should be undertaken with caution. We justify our position through a careful examination of supervised machine learning, and provide a useful analogy to help elucidate our findings. Three case studies are provided to demonstrate how supervised machine learning can be integrated into exploratory analysis. Supervised machine learning should be integrated into the scientific workflow with requisite caution. The approaches described in this paper provide ways to safely leverage the strengths of machine learning—like the flexibility ML algorithms can provide for fitting complex patterns—while avoiding potential pitfalls—at best, like wasted effort and money, and at worst, like misguided clinical recommendations—that may arise when trying to integrate findings from ML algorithms into domain knowledge.Key PointsSome supervised machine learning algorithms and statistical models are used to solve the same problem, y = f(x) + ε, but differ fundamentally in motivation and approach.The hypothetico-deductive framework—in which hypotheses are generated from prior beliefs and theory, and are assessed against data collected specifically to test that hypothesis—is one of the core frameworks comprising the scientific method. In the hypothetico-deductive framework, supervised machine learning can be used in an exploratory capacity. However, it cannot replace the use of statistical methods, even as explainable and interpretable machine learning methods become increasingly popular.Improper use of supervised machine learning in the hypothetico-deductive framework is tantamount to p-value hacking in statistical methods.
- Research Article
26
- 10.1016/j.solener.2023.111918
- Aug 3, 2023
- Solar Energy
Feature extraction-reduction and machine learning for fault diagnosis in PV panels
- Conference Article
4
- 10.1109/conecct55679.2022.9865722
- Jul 8, 2022
The introduction of Electronic Health Records (EHRs) is causing fast transformation in healthcare. EHR contains the patient private information and health history in digital form. Hence, EHR data cannot be shared due to privacy concerns to the Machine Learning(ML) research community, through which we can make the healthcare system smarter and provide quality healthcare services to the patients. As a result, synthetic data is utilised as a backup when real-world data (such as EHR data) is unavailable. Synthetic data can be shared without revealing any private information of the patient. This paper focuses on generating synthetic data from the real dataset. As a use case, we have selected Chronic Kidney Disease(CKD) dataset (real) and generated three datasets – real, synthetic, and a combination of real + synthetic. To test the accuracy of the synthetic data, we ran six supervised machine learning algorithms on these three datasets with all characteristics and reduced features to see if the patient had CKD or not. Supervised ML algorithms on the three datasets are assessed based on the following performance metrics - Confusion Matrix, Accuracy, Recall, Precision, and F1-Score. According to the results, XGBoost surpasses with 100 percent accuracy on all three datasets with full features and a 100 percent accuracy on the mix of real and synthetic datasets with feature reduction.
- Research Article
11
- 10.1016/j.trpro.2022.02.048
- Jan 1, 2022
- Transportation Research Procedia
Benchmarking machine learning algorithms by inferring transportation modes from unlabeled GPS data
- Research Article
11
- 10.1057/s41260-022-00302-z
- Jan 7, 2023
- Journal of Asset Management
This study provides an applicable methodological approach applying artificial intelligence (AI)-based supervised machine learning (ML) algorithms in risk assessment of post-pandemic household cryptocurrency investments and identifies the best performed ML algorithm and the most important risk assessment determinants. The empirical findings from analyzing 13 determinants from 1,000 dataset collected from major cryptocurrency communities online suggest that the logistic regression (LR) algorithm outperforms the remaining six ML algorithms by using performance metrics, lift chart, and ROC chart. Moreover, to make the ML algorithm results explainable and tackle the “black box” issue, the top five most important determinants are discovered, which are the interaction between investment amount and investment duration, investment amount, perception of traditional investments, cryptocurrency literacy, and perception of cryptocurrency volatility. The present study contributes to the literature on risk assessment, especially on the household cryptocurrency investments in the post-pandemic era and the body of knowledge on explainable supervised ML algorithms.
- Research Article
34
- 10.1080/02642069.2022.2054996
- Mar 25, 2022
- The Service Industries Journal
This study provides an applicable methodological procedure applying Artificial Intelligence (AI)-based supervised Machine Learning (ML) algorithms in detecting fake reviews of online review platforms and identifies the best ML algorithm as well as the most critical fake review determinants for a given restaurant review dataset. Our empirical findings from analyzing 16 determinants (review-related, reviewer-related, and linguistic attributes) measured from over 43,000 online restaurant reviews reveal that among the seven ML algorithms, the random forest algorithm outperforms the other algorithms and, among the 16 review attributes, time distance is found to be the most important, followed by two linguistic (affective and cognitive cues) and two review-related attributes (review depth and structure). The present study contributes to the literature on fake online review detection, especially in the hospitality field and the body of knowledge on supervised ML algorithms.
- Conference Article
6
- 10.1109/spec52827.2021.9709436
- Dec 6, 2021
Three-phase induction motors (IMs) are one of the most employed electric machines in industrial and household applications. Condition monitoring of these machines is essential to avoid unplanned maintenance and thereby enhance the availability. Artificial Intelligence (AI) technologies are emerging as an advanced tool for automating condition monitoring process to detect incipient faults at early stages. Machine Learning (ML) algorithms have been identified as a promising approach for condition monitoring of IMs and predicting maintenance to avoid failures. However, selecting the suitable ML algorithm for a given application is challenging because there is no predefined set of application-based algorithms. In addition, raw data processing and feature selection need careful attention to improve the accuracy of the results. This paper reviews supervised ML algorithms that can be used for condition monitoring of IMs and identifies their benefits and drawbacks. It then discusses how the dominant features from raw data can be selected through time domain and frequency domain analysis using the acoustic data collected from a three-phase induction motor. The study investigates classification accuracy of each ML algorithm and a procedure for selecting an algorithm based on the experimental results. Results of this study show that Support Vector Machines (SVM) algorithm outperforms other competing algorithms in condition monitoring of IMs when the dominant frequency components obtained through Fast Fourier Transform (FFT) are used as training data.
- Book Chapter
5
- 10.1007/978-981-16-5847-1_10
- Oct 12, 2021
In recent years, Machine Learning (ML) algorithms have gained much attention and found a profound importance in processing, classification as well as analysis of multispectral, and hyperspectral remotely sensed data. The core objectives of this chapter are firstly to provide a critical review on important advanced ML algorithms in remote sensing data classification, and analysis; secondly, examine the performance of widely used important supervised ML algorithms namely Random Forest (RF), Support Vector Machine (SVM), and Classification and Regression Tree (CART) in satellite image classification, and analysis on Google Earth Engine (GEE) platform to derive distinct Land Use/Land Cover (LULC) classes. ML algorithms are being extensively used in optical remote sensing data analysis it includes the image classification algorithms to precisely allocate objects to a distinct set of known classes, the clustering algorithms to group the objects into classes based on a given set of input variables, the regression algorithms to forecast a response variable from a given a set of covariates, and the dimensionality reduction algorithms to build a small set of new variables that includes most of the information available in the input set of numerous variables. In the study, among the three tested supervised ML algorithms in LULC classification, CART algorithm shows relatively better performance than the RF, and SVM algorithms. The study concludes that advanced ML algorithms have immense potential in optical remote sensing data classification, and analysis to attain the higher classification accuracy.
- Conference Article
3
- 10.1115/imece2023-114248
- Oct 29, 2023
Surface roughness quality has implications on the functionally, assembly, service life, and appearance of the machined product. Considering the complex nature of metal cutting processes, computer simulation models may not provide the needed accuracy to predict surface conditions under all cutting conditions. Therefore, machine learning (ML) techniques can provide more reliable predication models that are based on real time cutting process sensory data. The implementation of artificial intelligence (AI) techniques in the monitoring of manufacturing processes has been gaining momentum. The focus of this study is to predict the surface roughness using acoustic emissions (AE) signals during the dry end milling of stainless steel. AE sensors have been widely used to monitor the condition of structures and manufacturing processes. Furthermore, acoustic sensors are non-invasive and can be used at any location without disrupting or stopping the machining process. Features extracted from the AE signals are used as surface roughness quality indicators. These features include frequency bands averaged amplitudes, statistical quantities of the wavelet decompositions, raw signal RMS values, and crest factor. In this work, several machine learning algorithms are used to process the extracted AE features for surface roughness characterization. The total AE features are first processed for feature set reduction since many of the features are highly correlated. This is done using both supervised and unsupervised feature reduction and subset selection methods. The features extracted from supervised feature reduction methods are used to train three supervised classifiers — k-nearest neighbor (kNN) classifier, a radial-basis function support vector machine (RBF-SVM), and a random forest (RF) classifier. The reduced feature set from the unsupervised feature reduction methods are used as input to two unsupervised clustering methods — K-Means and DBSCAN. The classifier models are trained using multi-fold cross-validated mix of subsets of the reduced features. In this study we have used ten models using two-fold cross validation for training and validation for the supervised learning methods. The results of supervised classification are compared to unsupervised clustering and are reported for an average of the ten models (or ten runs with distinct initializations of the clustering algorithm), along with a detailed nonparametric testing to verify statistical significance in performance level between pairs of algorithms.
- Research Article
4
- 10.5812/iranjradiol-119266
- Sep 11, 2022
- Iranian Journal of Radiology
Background: Accurate differentiation of angiomyolipoma (AML) from renal cell carcinoma (RCC) is important in RCC diagnosis. Objectives: This study aimed to evaluate the performance of different supervised machine learning (ML) algorithms for RCC based on computed tomography (CT) examinations. Patients and Methods: The CT images of known cases of RCC or renal AML were collected and divided into training and testing groups. The texture features of CT images were drawn and quantified in MaZda software; a total of 352 features were drawn from each image. Top 10 features with statistical significance for differentiation of RCC from benign tumors in the training group were selected to establish diagnosis models based on 16 supervised ML algorithms. Next, the models were compared regarding accuracy and specificity. The trained models were further examined by comparison with data from the testing group. Results: Among 16 classifiers trained in this study, the logistic regression, linear discriminant analysis, k-nearest neighbor algorithm, support vector machines (SVMs), ridge classifier, AdaBoost classifier, gradient boosting classifier, and CatBoost classifier showed good performance in discriminating RCC from AML (accuracy, ≥ 0.7; area under the [receiver operating characteristic [ROC]] curve [AUC] ≥ 0.75) in both training and testing datasets. Conclusion: Based on the ML algorithms for big data, diagnostic classifiers can be valuable tools for an accurate diagnosis of RCC. By comparing different algorithms, the present results indicated potential algorithms for the development of RCC diagnostic classifiers.
- Book Chapter
- 10.1007/978-3-031-00978-5_21
- Jan 1, 2022
Network functions virtualization architecture concept is gaining more popularity and it is used in different systems. Together with the cloudification within public, private and mixed clouds it is becoming a base for the future development of the digital world. The concepts of containers, virtual network functions, application functions are cohered within the clouds and guided with the NFV systems. Another aspect which is developing rapidly are the access technologies, especially the 5G, which is the all expected enabler of the IoT. Within such circumstances, most of the network traffic is expected to flow in the east–west direction, never leaving the cloud. Our work if focused on preparation of experimental environment that will simulate such traffic. We are analysing the traffic by making classification of the network data flows, using a selected set of six supervised machine learning (ML) algorithms. The goal of our research is to find the algorithm with the best performance within the prepared environment. We define the performance as a combination of the ML algorithm's classification precision, and the time consumption of the algorithm, which bears a great significance, especially from a point of 5G, where any packet delay introduced within the system may compromise the 5G specification calls for latency. From the research we conclude that out of the 6 explored ML algorithms, the Decision Tree algorithms is the most suitable classifier that fits within the needed precision across all classes, but also within the time consumption needs. Our approach also considers the regulatory point of view for automated data analysis within systems, and we deal only with statistical features of the network flows, while the payload data, the source and destination information, as well as the network port, are excluded as attributes used for classification, especially as we deal with VoIP and encrypted VoIP data that is used in 5G.
- Research Article
36
- 10.1016/j.rsase.2021.100569
- Jun 24, 2021
- Remote Sensing Applications: Society and Environment
A simple and robust wetland classification approach by using optical indices, unsupervised and supervised machine learning algorithms
- Conference Article
- 10.3990/2.378
- Jan 1, 2016
This study examined the value of automated and manual feature selection, when applied to machine learning and object-based image analysis (OBIA), for the differentiation of crops in a Mediterranean climate. Five Landsat8 images covering the phenological stages of seven major crops types in the study area (Cape Winelands, South Africa) were acquired and processed. A statistical image fusion technique was used to enhance the spatial resolution of the imagery. The pan-sharpened imagery was used to produce a range of spectral features, textural measures, indices and colour transformations, after which it was segmented using the multi-resolution (MRS) algorithm. The entire set of 205 features (41 per image capture date) was then subjected to different feature selection and reduction methods. The feature selection and reduction methods included manual feature removal (i.e. grouping into semantic themes), filter methods (such as classification and regression trees (CART) and random forest (RF)), and statistical principal components analysis (PCA). The experiments were carried out in two scenarios, namely 1) on all input images in combination; and 2) on each individual image date. The feature subsets were used as input to decision trees (DTs), k-nearest neighbour (k-NN), support vector machine (SVM), and random forest (RF) machine learning classifiers. In order to assess the value of each feature reduction method (comprising feature reduction and selection techniques), overall accuracy, kappa coefficient and McNemar’s test were employed to assess classification accuracy and compare the results. The results show that feature selection was able to improve the overall crop identification accuracy for the DT, k-NN, and RF classifiers, but was unable to do so for SVM. SVM scored the highest overall accuracy and kappa coefficient, even without applying feature reduction or selection. Based on these results it was concluded that, although feature selection can aid the crop differentiation process, it is not a necessity.
- Research Article
121
- 10.1016/j.foodchem.2021.131471
- Oct 26, 2021
- Food Chemistry
The application of machine-learning and Raman spectroscopy for the rapid detection of edible oils type and adulteration
- Research Article
- 10.22214/ijraset.2023.54487
- Jun 30, 2023
- International Journal for Research in Applied Science and Engineering Technology
Abstract: Software Flaw Projection (SFP) is an important issue in software development and maintenance process. Software flaws can cause significant problems for software development teams. So, projecting the software faults in earlier phase improves the software quality, reliability, efficiency and reduces the software cost. However, developing robust flaw projection model is a challenging task and many techniques have been proposed. Projecting the likelihood of flaws occurring in software can help developers prevent or mitigate their impact. This paper presents a software flaw projection model based on Machine Learning (ML) algorithms. Supervised ML algorithms have been used to predict future software faults based on historical data. The evaluation process proved that ML algorithms can be used effectively with high accuracy rate. Furthermore, a comparison measure is applied to compare the proposed prediction model with other approaches. The collected results showed that the ML approach has a better performance