Modular Local Classification via Cluster-Guided Feature Selection in Tabular Data

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Modular Local Classification via Cluster-Guided Feature Selection in Tabular Data

Similar Papers
  • Supplementary Content
  • 10.1177/20552076251384237
Prediction of new-onset atrial fibrillation in sepsis patients by machine learning: A systematic review
  • May 1, 2025
  • Digital Health
  • Shuxuan Ye + 7 more

AimsNew-onset atrial fibrillation (NOAF) occurs in approximately 23% of patients with sepsis and is independently associated with increased mortality. Therefore, early prediction of NOAF has significant clinical value. However, current artificial intelligence (AI) models predominantly rely on tabular data. These unimodal AI models face limitations in predicting NOAF as they fail to fully utilize the predictive potential arising from the interplay of multimodal data.MethodsWe reviewed current Machine Learning (ML) and Deep Learning (DL) approaches for atrial fibrillation (AF) prediction. It summarizes the selected features in ML models for predicting AF in ICU patients, and the advantages of time-window selection in DL models using electrocardiogram (ECG) signals. Notably, we compared these models in terms of feature selection, prediction horizons, and performance when applied to tabular data and ECG signal features. To enhance the predictive capability of ML for NOAF in patients with sepsis, we drew inspiration from multimodal models developed for other diseases, such as Alzheimer's disease, and proposed integrating tabular data and ECG signal data within a multimodal framework.ResultsThis study systematically analyzed the application of ML and DL in AF prediction. After screening, 12 studies (6 ML, 6 DL) were included. ML models, based on electronic medical records (EMR) or ECG features, achieved prediction windows ranging from minutes to hours with AUCs of 0.74–0.90. DL models processing raw ECG signals extended prediction windows to days, achieving AUCs of 0.74–0.96, with performance improving with larger datasets. A Transformer-based multimodal model (integrating clinical data and ECG) was proposed to enhance AF prediction in sepsis patients, though further validation is needed for cross-modal data fusion feasibility.ConclusionsTransitioning from unimodal predictive models to multimodal frameworks that combine tabular clinical data and raw ECG signals is feasible within the current deep-learning framework. This approach has the potential to significantly improve the early prediction capabilities of NOAF in sepsis patients.

  • Research Article
  • 10.1371/journal.pone.0339864
FKG-MM: A multi-modal fuzzy knowledge graph with data integration in healthcare
  • Jan 2, 2026
  • PLOS One
  • Nguyen Hong Tan + 5 more

Artificial Intelligence (AI) has been dramatically applied to healthcare in various tasks to support clinicians in disease diagnosis and prognosis. It has been known that accurate diagnosis must be drawn from multiple evidence, namely clinical records, X-Ray images, IoT data, etc called the multi-modal data. Despite the existence of various approaches for multi-modal medical data fusion, the development of comprehensive systems capable of integrating data from multiple sources and modalities remains a considerable challenge. Besides, many machine learning models face difficulties in representation and computation due to the uncertainty and diversity of medical data. This study proposes a novel multi-modal fuzzy knowledge graph framework, called FKG-MM, which integrates multi-modal medical data from multiple sources, offering enhanced computational performance compared to unimodal data. In addition, the FKG-MM framework is based on the fuzzy knowledge graph model, one of the models that represent and compute effectively with medical data in tabular form. Through some experiment scenarios utilizing the well-known BRSET dataset on multi-modal diabetic retinopathy, it has been experimentally validated that the feature selection method, when combining image features with tabular medical data features, gives the highest reliability results among 5 methods including Feature Selection Method, Tensor Product, Hadamard Product, Filter Selection, and Wrapper Selection. In addition, the experiment also confirms that the accuracy of FKG-MM increases by 12–14% when combining image data with tabular medical data than the related methods diagnosing only on tabular data.

  • Research Article
  • 10.1038/s41598-025-96109-0
Advanced solar radiation prediction using combined satellite imagery and tabular data processing
  • Apr 23, 2025
  • Scientific Reports
  • Mohammed Attya + 3 more

Accurate solar radiation prediction is crucial for optimizing solar energy systems. There are two types of data that can be used to predict solar radiation, such as satellite images and tabular satellite data. This research focuses on enhancing solar radiation prediction by integrating data from two distinct sources: satellite imagery and ground-based measurements. By combining these datasets, the study improves the accuracy of solar radiation forecasts, which is crucial for renewable energy applications. This research presents a hybrid methodology to predict the solar radiation from both satellite images and satellite data. The methodology basis on two datasets; the first data set contains tabular data, and the second dataset contains satellite images. The framework divides into two paths; the first path take the input as the satellite images; this stages contains three steps; the first step is removing noise using latent diffusion model, the second step is about pixel imputation using a modified RF + Identity GAN (this model contains two modification the first modification is adding the identity block to solve mode collapse problem in the GANs and the second modification is to add the 8-connected pixel to generate a value of missing pixel near to the real missed pixel. The third step in the first path is about using the self-organizing map to identify the special informative in the satellite image. The second path take the input as tabular data and use the diffusion model to impute the missing data in the tabulated data. Finally, we merge the two path and use feature selection to be as input for the LSTM for solar radiation predictions. The experiments done prove the efficiency of the used stage such as missing pixel imputation, removing noise, missing data imputation and prediction using LSTM when compared with other available techniques. The experiments also prove the enhancement of all prediction model after adding two paths before the prediction step.

  • Research Article
  • 10.1609/aaai.v37i8.26090
Weight Predictor Network with Feature Selection for Small Sample Tabular Biomedical Data
  • Jun 26, 2023
  • Proceedings of the AAAI Conference on Artificial Intelligence
  • Andrei Margeloiu + 3 more

Tabular biomedical data is often high-dimensional but with a very small number of samples. Although recent work showed that well-regularised simple neural networks could outperform more sophisticated architectures on tabular data, they are still prone to overfitting on tiny datasets with many potentially irrelevant features. To combat these issues, we propose Weight Predictor Network with Feature Selection (WPFS) for learning neural networks from high-dimensional and small sample data by reducing the number of learnable parameters and simultaneously performing feature selection. In addition to the classification network, WPFS uses two small auxiliary networks that together output the weights of the first layer of the classification model. We evaluate on nine real-world biomedical datasets and demonstrate that WPFS outperforms other standard as well as more recent methods typically applied to tabular data. Furthermore, we investigate the proposed feature selection mechanism and show that it improves performance while providing useful insights into the learning task.

  • Research Article
  • Cite Count Icon 11
  • 10.1371/journal.pone.0295598
A comparative analysis of converters of tabular data into image for the classification of Arboviruses using Convolutional Neural Networks
  • Dec 8, 2023
  • PLOS ONE
  • Leonides Medeiros Neto + 2 more

Tabular data is commonly used in business and literature and can be analyzed using tree-based Machine Learning (ML) algorithms to extract meaningful information. Deep Learning (DL) excels in data such as image, sound, and text, but it is less frequently utilized with tabular data. However, it is possible to use tools to convert tabular data into images for use with Convolutional Neural Networks (CNNs) which are powerful DL models for image classification. The goal of this work is to compare the performance of converters for tabular data into images, select the best one, optimize a CNN using random search, and compare it with an optimized ML algorithm, the XGBoost. Results show that even a basic CNN, with only 1 convolutional layer, can reach comparable metrics to the XGBoost, which was trained on the original tabular data and optimized with grid search and feature selection. However, further optimization of the CNN with random search did not significantly improve its performance.

  • Research Article
  • 10.32628/ijsrst2512362
Transforming Tabular Data into Image Features for Robust DDoS Attack Detection Using Deep Learning
  • May 22, 2025
  • International Journal of Scientific Research in Science and Technology
  • Verma Jyoti Sukhdev Sushila + 1 more

This paper introduces an innovative deep learning-based method for robust DDoS attack detection by transforming tabular network traffic data into color image representations. Traditional machine learning approaches often struggle to capture the complex relationships present in tabular data, limiting their effectiveness against sophisticated cyber threats. To address this, the proposed approach converts each network traffic record into a structured color image, enabling the model to learn spatial and chromatic feature correlations. Leveraging EfficientNetB0, a cutting-edge convolutional neural network architecture, the system extracts rich feature representations and performs precise classification. Key preprocessing steps, including feature selection and data augmentation, are applied to improve input quality and model generalization. Experimental results demonstrate that the proposed model achieves outstanding performance, with an accuracy of 99.0%, precision of 99.2%, recall of 98.8%, and an F1-score of 99.0%. These metrics highlight the model’s exceptional ability to accurately identify and distinguish DDoS attacks from normal traffic. By transforming tabular data into informative color images and utilizing advanced deep learning techniques, this study presents a scalable and effective framework for enhancing network security and improving detection capabilities against increasingly complex cyberattacks.

  • Conference Article
  • Cite Count Icon 6
  • 10.1145/3543507.3583382
DIWIFT: Discovering Instance-wise Influential Features for Tabular Data
  • Apr 30, 2023
  • Dugang Liu + 8 more

Tabular data is one of the most common data storage formats behind many real-world web applications such as retail, banking, and e-commerce. The success of these web applications largely depends on the ability of the employed machine learning model to accurately distinguish influential features from all the predetermined features in tabular data. Intuitively, in practical business scenarios, different instances should correspond to different sets of influential features, and the set of influential features of the same instance may vary in different scenarios. However, most existing methods focus on global feature selection assuming that all instances have the same set of influential features, and few methods considering instance-wise feature selection ignore the variability of influential features in different scenarios. In this paper, we first introduce a new perspective based on the influence function for instance-wise feature selection, and give some corresponding theoretical insights, the core of which is to use the influence function as an indicator to measure the importance of an instance-wise feature. We then propose a new solution for discovering instance-wise influential features in tabular data (DIWIFT), where a self-attention network is used as a feature selection model and the value of the corresponding influence function is used as an optimization objective to guide the model. Benefiting from the advantage of the influence function, i.e., its computation does not depend on a specific architecture and can also take into account the data distribution in different scenarios, our DIWIFT has better flexibility and robustness. Finally, we conduct extensive experiments on both synthetic and real-world datasets to validate the effectiveness of our DIWIFT.

  • Research Article
  • 10.4018/ijssci.399476
Adaptive 1-D CNN Using LSelect Feature Selection for Predicting Software Faults
  • Jan 22, 2026
  • International Journal of Software Science and Computational Intelligence
  • Tamanna Mishra + 1 more

Software Fault Prediction (SFP) is the process of predicting fault-prone software constructs during the initial phases of software development. Deep Neural Networks (DNN) have been hugely successful in the field of computer vision, audio, etc., where the input data is correlated spatially and temporally. In contrast, SFP operates on tabular data, rows of software metrics that lack the inherent spatial structure exploited by convolutional architectures in image domains. The authors have tried to remodel the most successful Convolutional Neural Network (CNN) for tabular data. A novel framework is proposed employing a tree-based feature selection technique, LSelect, to find the most significant features and an adaptive 1-dimensional Convolutional Neural Network (ACNN) for the classification task, which selects an optimal learning rate automatically. ACNN converts the tabular data (1-D) into 2-D using adaptive pooling layers, thereby forming an image from 1-D data. The framework classification results (Area under Curve) are compared with nine state-of-the-art algorithms, such as XGBoost, LightGBM, etc., and performance is validated using the Bayesian Signed Rank Test. It is found that the proposed framework performs comparably with the state-of-the-art methods with reduced model complexity. Also, the LSelect feature selection technique improves average model performance by 1.3%.

  • Research Article
  • Cite Count Icon 2
  • 10.1556/1647.2023.00109
A machine learning framework for performing binary classification on tabular biomedical data
  • Jun 26, 2023
  • Imaging
  • Ádám Szijártó + 6 more

Background and aimOver the past decades, we have witnessed an immense expansion in the arsenal and performance of machine learning (ML) algorithms. One of the most important fields that could benefit from these advancements is biomedical science. To streamline the training and evaluation of binary classifiers, we constructed a universal and flexible ML framework that uses tabular biomedical data as input.Methods and resultsOur framework requires the input data to be provided as a comma-separated values file, in which rows correspond to subjects and columns represent different features. After reading the content of this file, the framework enables the users to perform outlier detection, handle missing values, rescale features, and tackle class imbalance. Then, hyperparameter tuning, feature selection, and internal validation are performed using nested cross-validation. If an additional dataset is available, the framework also provides the option for external validation. Users may also compute SHapley Additive exPlanations values to interpret the individual predictions of the model and identify the most important features. Our ML framework was implemented in Python (version 3.9), and its source code is freely available via GitHub. In the second part of this paper, we also demonstrate the usage of the framework through a case study from the field of cardiovascular imaging.ConclusionsThe proposed ML framework enables the efficient training and evaluation of binary classifiers on tabular biomedical data. We hope our framework will serve as a useful resource for both learning and research purposes and will promote further innovation.

  • Research Article
  • Cite Count Icon 9
  • 10.1109/access.2022.3164104
Multi-Class Intrusion Detection Using Two-Channel Color Mapping in IEEE 802.11 Wireless Network
  • Jan 1, 2022
  • IEEE Access
  • Muhamad Erza Aminanto + 5 more

The rise of interconnected devices through wireless networks provides two sides consequences. On one side, it helps many human tasks; on the other hand, the prone wireless medium opens the vulnerable system to be exploited by adversaries. An Intrusion Detection System (IDS) is one method to inspect the network traffic by leveraging state-of-the-art anomaly detection techniques. Deep learning models have been utilized to distinguish the benign and malicious traffic. However, projecting the tabular data into images before the image classification has been the main challenge of leveraging deep learning for IDS purposes. We propose the novel projection of tabular data into 2-coded color mapping for IDS purposes. The proposed method employs a feature selection method to ensure optimal dimensionality. We examined the different number of attribute subsets to obtain the relationship between the attributes. Furthermore, it takes advantage of the Convolutional Neural Network (CNN) model to classify the Wi-Fi attacks. We evaluate the proposed model using the most common Wi-Fi attacks dataset, Aegean Wi-Fi Intrusion Dataset (AWID2). The proposed method achieved an F1 score of 99.73% and a false positive rate of 0.24%. This study highlights the importance of addressing the mapping procedures from tabular data into grid-based data before deep learning training and validates the effectiveness of CNN to detect multiple types of wireless network attacks.

  • Research Article
  • Cite Count Icon 7
  • 10.1016/j.knosys.2024.111523
GAEFS: Self-supervised Graph Auto-encoder enhanced Feature Selection
  • Feb 14, 2024
  • Knowledge-Based Systems
  • Jun Tan + 2 more

GAEFS: Self-supervised Graph Auto-encoder enhanced Feature Selection

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/bigdata55660.2022.10020842
Model-free feature selection to facilitate automatic discovery of divergent subgroups in tabular data
  • Dec 17, 2022
  • Girmaw Abebe Tadesse + 3 more

Data-centric AI encourages the need for cleaning, evaluating, and understanding data in order to achieve trustworthy AI. Existing technologies, such as AutoML, make it easier to design and train models automatically, but there is a lack of a similar level of capability to extract data-centric insights. Manual stratification of tabular data per a given feature of interest (e.g., gender) is limited to scaling up for higher feature dimension, which could be addressed using automatic discovery of divergent/anomalous subgroups. Nonetheless, these automatic discovery techniques often search across potentially exponential combinations of features which could be simplified using a preceding feature selection step. Existing feature selection techniques for tabular data often involve fitting a particular model (e.g., XGBoost) in order to select important features. However, such model-based selection is prone to model-bias and spurious correlations in addition to requiring extra resources to design, fine-tune and train a model. In this paper, we propose a model-free and sparsity-based automatic feature selection (SAFS) framework to facilitate automatic discovery of divergent subgroups. Different to filter-based selection techniques, we exploit the sparsity of objective measures among feature values to rank and select features. We validated SAFS across two publicly available datasets (MIMIC-III and Allstate Claims) and compared it with six existing feature selection methods. SAFS achieves a reduction of the feature selection time by a factor of 81× and 104×, averaged cross the existing methods in the MIMIC-III and Claims datasets, respectively. SAFS-selected features are also shown to achieve competitive detection performance, e.g., 18.3% of features selected by SAFS detected similar divergent group compared to using the whole features, in the Claims dataset, with a Jaccard similarity of 0.95 but with a 16× reduction in detection time.

  • Book Chapter
  • Cite Count Icon 2
  • 10.1007/978-3-030-77967-2_51
Comparison of Efficiency, Stability and Interpretability of Feature Selection Methods for Multiclassification Task on Medical Tabular Data
  • Jan 1, 2021
  • Ksenia Balabaeva + 1 more

Feature selection is an important step of machine learning pipeline. Certain models may select features intrinsically without human interactions or additional algorithms applied. Such algorithms usually belong to neural networks class. Others require help of a researcher or feature selection algorithms. However, it is hard to know beforehand which variables contain the most relevant information and which may cause difficulties for a model to learn the correct relations. In that respect, researchers have been developing feature selection algorithms. To understand what methods perform better on tabular medical data, we have conducted a set of experiments to measure accuracy, stability and compare interpretation capacities of different feature selection approaches. Moreover, we propose an application of Bayesian Inference to the task of feature selection that may provide more interpretable and robust solution. We believe that high stability and interpretability are as important as classification accuracy especially in predictive tasks in medicine.

  • Research Article
  • 10.1145/3689428
Graph Representation Learning Enhanced Semi-Supervised Feature Selection
  • Nov 12, 2024
  • ACM Transactions on Knowledge Discovery from Data
  • Jun Tan + 2 more

Feature selection is a key step in machine learning by eliminating features that are not related to the modeling target to create reliable and interpretable models. By exploring the potential complex correlations among features of unlabeled data, recently introduced self-supervision-enhanced feature selection greatly reduces the reliance on the labeled samples. However, they are generally based on the autoencoder with sample-wise self-supervision, which can hardly exploit the relations among samples. To address this limitation, this article proposes graph representation learning enhanced semi-supervised feature selection (G-FS) which performs feature selection based on the discovery and exploitation of the non-Euclidean relations among features and samples by translating unlabeled “plain” tabular data into a bipartite graph. A self-supervised edge prediction task is designed to distill rich information on the graph into low-dimensional embeddings, which remove redundant features and noise. Guided by the condensed graph representation, we propose a batch attention feature weight generation mechanism that generates more robust weights according to batch-based selection patterns rather than individual samples. The results show that G-FS achieves significant performance edges in 14 datasets compared to twelve state-of-the-art baselines, including two recent self-supervised baselines. The source code is public available at https://github.com/Icannotnamemyselff/G-FS_Graph_enhacned_feature_selection .

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 5
  • 10.3390/app14135826
XGBoost-Enhanced Graph Neural Networks: A New Architecture for Heterogeneous Tabular Data
  • Jul 3, 2024
  • Applied Sciences
  • Liuxi Yan + 1 more

Graph neural networks (GNNs) perform well in text analysis tasks. Their unique structure allows them to capture complex patterns and dependencies in text, making them ideal for processing natural language tasks. At the same time, XGBoost (version 1.6.2.) outperforms other machine learning methods on heterogeneous tabular data. However, traditional graph neural networks mainly study isomorphic and sparse data features. Therefore, when dealing with tabular data, traditional graph neural networks encounter challenges such as data structure mismatch, feature selection, and processing difficulties. To solve these problems, we propose a novel architecture, XGNN, which combines the advantages of XGBoost and GNNs to deal with heterogeneous features and graph structures. In this paper, we use GAT for our graph neural network model. We can train XGBoost and GNN end-to-end to fit and adjust the new tree in XGBoost based on the gradient information from the GNN. Extensive experiments on node prediction and node classification tasks demonstrate that the performance of our proposed new model is significantly improved for both prediction and classification tasks and performs particularly well on heterogeneous tabular data.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.