Large Number Of Features Research Articles

Biomarker discovery exploiting feature importance of machine learning has risen recently in the microbiome landscape with its high predictive performance in several disease states. To have a concrete selection among a high number of features, recursive feature elimination (RFE) has been widely used in the bioinformatics field. However, machine learning-based RFE has factors that decrease the stability of feature selection. In this article, we suggested methods to improve stability while sustaining performance. We exploited the abundance matrices of the gut microbiome (283 taxa at species level and 220 at genus level) to classify between patients with inflammatory bowel disease (IBD) and healthy control (1,569 samples). We found that applying an already published data transformation before RFE improves feature stability significantly. Moreover, we performed an in-depth evaluation of different variants of the data transformation and identify those that demonstrate better improvement in stability while not sacrificing classification performance. To ensure a robust comparison, we evaluated stability using various similarity metrics, distances, the common number of features, and the ability to filter out noise features. We were able to confirm that the mapping by the Bray-Curtis similarity matrix before RFE consistently improves the stability while maintaining good performance. Multilayer perceptron algorithm exhibited the highest performance among 8 different machine learning algorithms when a large number of features (a few hundred) were considered based on the best performance across 100 bootstrapped internal test sets. Conversely, when utilizing only a limited number of biomarkers as a trade-off between optimal performance and method generalizability, the random forest algorithm demonstrated the best performance. Using the optimal pipeline we developed, we identified 14 biomarkers for IBD at the species level and analyzed their roles using Shapley additive explanations. Taken together, our work not only showed how to improve biomarker discovery in the metataxonomic field without sacrificing classification performance but also provided useful insights for future comparative studies.

Read full abstract

Context. Cluster analysis is widely used to analyze data of various nature and dimensions. However, the known methods of cluster analysis are characterized by low speed and are demanding on computer memory resources due to the need to calculate pairwise distances between instances in a multidimensional feature space. In addition, the results of known methods of cluster analysis are difficult for human perception and analysis with a large number of features. Objective. The purpose of the work is to increase the speed of cluster analysis, the interpretability of the resulting partition into clusters, as well as to reduce the requirements of cluster analysis to computer memory. Method. A method for cluster analysis of multidimensional data is proposed, which for each instance calculates its hash based on the distance to the conditional center of coordinates, uses a one-dimensional coordinate along the hash axis to determine the distances between instances, considers the resulting hash as a pseudo-output feature, breaking it into intervals, which matches the labels pseudo-classes – clusters, having received a rough crisp partition of the feature space and sample instances, automatically generates a partition of input features into fuzzy terms, determines the rules for referring instances to clusters and, as a result, forms a fuzzy inference system of the Mamdani-Zadeh classifier type, which is further trained in the form of a neuro-fuzzy network to ensure acceptable values of the clustering quality functional. This makes it possible to reduce the number of terms and features used, to evaluate their contribution to making decisions about assigning instances to clusters, to increase the speed of data cluster analysis, and to increase the interpretability of the resulting data splitting into clusters. Results. The mathematical support for solving the problem of cluster data analysis in conditions of large data dimensions has been developed. The experiments confirmed the operability of the developed mathematical support have been carried out. Conclusions. . The developed method and its software implementation can be recommended for use in practice in the problems of analyzing data of various nature and dimensions.

Read full abstract

Large Number Of Features Research Articles

Articles published on Large Number Of Features

Machine learning-based feature selection to search stable microbial biomarkers: application to inflammatory bowel disease.

DATA CLUSTERING BASED ON INDUCTIVE LEARNING OF NEURO-FUZZY NETWORK WITH DISTANCE HASHING

Optimize temporal configuration for motor imagery-based multiclass performance and its relationship with subject-specific frequency

Bayesian bi-clustering methods with applications in computational biology

Hybrid Sequential Feature Selection with Ensemble Boosting Class-based Classification Method

Ensembles of Random SHAPs

Evaluation of Boruta algorithm in DDoS detection

Graph-Based Multi-Label Classification for WiFi Network Traffic Analysis

A malware detection system using a hybrid approach of multi-heads attention-based control flow traces and image visualization

Automatic Prediction of T2/T3 Staging of Rectal Cancer Based on Radiomics and Machine Learning

Detection of Fall Risk in Multiple Sclerosis by Gait Analysis-An Innovative Approach Using Feature Selection Ensemble and Machine Learning Algorithms.

Sparse Data Reconstruction, Missing Value and Multiple Imputation through Matrix Factorization

Distributed Fuzzy Cognitive Maps for Feature Selection in Big Data Classification

A Modified Firefly Deep Ensemble for Microarray Data Classification

Used Car Price Prediction Based on the Iterative Framework of XGBoost+LightGBM

SCNIC: Sparse correlation network investigation for compositional data.

Research on imbalance machine learning methods for MRT_1WI soft tissue sarcoma data

A novel SSD fault detection method using GRU-based Sparse Auto-Encoder for dimensionality reduction

Feature Extraction and Recognition of Human Physiological Signals Based on the Convolutional Neural Network

A novel Ontology-guided Attribute Partitioning ensemble learning model for early prediction of cognitive deficits using quantitative Structural MRI in very preterm infants

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Large Number Of Features Research Articles

Articles published on Large Number Of Features

Machine learning-based feature selection to search stable microbial biomarkers: application to inflammatory bowel disease.

DATA CLUSTERING BASED ON INDUCTIVE LEARNING OF NEURO-FUZZY NETWORK WITH DISTANCE HASHING

Optimize temporal configuration for motor imagery-based multiclass performance and its relationship with subject-specific frequency

Bayesian bi-clustering methods with applications in computational biology

Hybrid Sequential Feature Selection with Ensemble Boosting Class-based Classification Method

Ensembles of Random SHAPs

Evaluation of Boruta algorithm in DDoS detection

Graph-Based Multi-Label Classification for WiFi Network Traffic Analysis

A malware detection system using a hybrid approach of multi-heads attention-based control flow traces and image visualization

Automatic Prediction of T2/T3 Staging of Rectal Cancer Based on Radiomics and Machine Learning

Detection of Fall Risk in Multiple Sclerosis by Gait Analysis-An Innovative Approach Using Feature Selection Ensemble and Machine Learning Algorithms.

Sparse Data Reconstruction, Missing Value and Multiple Imputation through Matrix Factorization

Distributed Fuzzy Cognitive Maps for Feature Selection in Big Data Classification

A Modified Firefly Deep Ensemble for Microarray Data Classification

Used Car Price Prediction Based on the Iterative Framework of XGBoost+LightGBM

SCNIC: Sparse correlation network investigation for compositional data.

Research on imbalance machine learning methods for MRT_1WI soft tissue sarcoma data

A novel SSD fault detection method using GRU-based Sparse Auto-Encoder for dimensionality reduction

Feature Extraction and Recognition of Human Physiological Signals Based on the Convolutional Neural Network

A novel Ontology-guided Attribute Partitioning ensemble learning model for early prediction of cognitive deficits using quantitative Structural MRI in very preterm infants