High Dimensional Gene Expression Datasets Research Articles

INTRODUCTION: Gene expression data analysis is a critical aspect of disease prediction and classification, playing a pivotal role in the field of bioinformatics and biomedical research. High-dimensional gene expression datasets hold a wealth of information, but their effective utilization is hindered by the presence of irrelevant dimensions and noise. The challenge lies in extracting meaningful features from these datasets to enhance the accuracy of disease prediction and classification while maintaining computational efficiency. Feature selection is a crucial step in addressing these challenges, as it aims to identify and retain only the most informative characteristics from large high-dimensional microarray datasets. In the context of microarray gene expression data, characterized by its substantial dimensionality, selecting relevant features is essential for efficient nearest neighbor search, a fundamental component of various analytical tasks in bioinformatics and data mining. Existing feature selection methods in high-dimensional data often face issues related to the trade-off between search accuracy and computational efficiency. This paper introduces a novel approach, the Nearest Neighbor Feature Selection with Symmetrical Uncertainty-based Redundancy Removal (NNFSRR) method, designed to enhance the classification of microarray gene expression data through feature selection. The NNFSRR method focuses on reducing the dimensionality of the dataset by identifying and removing redundant features, allowing subsequent searches to operate solely on relevant dimensions. OBJECTIVES: The primary goal is to evaluate the NNFSRR method's effectiveness in improving nearest neighbor search in microarray gene expression datasets by reducing dimensionality. This method utilizes Symmetrical Uncertainty-based correlation between dimensions for feature selection and aims to enhance accuracy and efficiency compared to existing methods. METHODS: The NNFSRR method uses Symmetrical Uncertainty to identify and remove redundant features from microarray gene expression datasets. Reduced datasets are used for nearest neighbor search, improving accuracy and efficiency. Experiments are conducted using real-world datasets, and comparisons with existing methods are made based on search time and accuracy. RESULTS: The NNFSRR method demonstrates improved nearest neighbor search performance, outperforming basic brute force methods and existing feature selection techniques. Selected feature sets exhibit strong class associations while minimizing feature correlations, enhancing classification precision. CONCLUSION: In conclusion, the NNFSRR method presents a promising approach to address the challenges posed by high-dimensional gene expression data. It effectively reduces dimensionality, improves search accuracy, and enhances the efficiency of nearest neighbor search. Our experimental results demonstrate that this method outperforms existing techniques in terms of search time and accuracy, making it a valuable tool for applications in bioinformatics, data mining, pattern recognition, and biological information retrieval. The NNFSRR method holds the potential to advance our understanding of complex biological processes and support more accurate disease prediction and classification.

Read full abstract

In the era of Big Data, cluster analysis of high-dimensional data sets often suffers from the <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Curse of dimensionality . To overcome this problem, the dimensionality reduction through <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">feature selection becomes inevitable. Co-clustering or two-way clustering is considered to be a more sophisticated tool than conventional one-way clustering. Moreover, the advent of multi-view learning shows that the subjects of a data set can be interpreted in many ways. Interestingly, a minimal number of existing feature selection algorithms take advantage of the co-clustering method and are designed to consider multi-view data. Motivated by this, in the current article, we propose a feature (gene) selection method for high dimensional gene expression (GE) data through a <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">m ulti-objective optimization based <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">m ulti-view <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Co <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"/> <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">-Clus <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"/> tering algorithm (named <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">MMCo- <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Clus <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"/> ). A popular evolutionary technique – Non-dominated Sorting Genetic Algorithm-II (NSGA-II) has been utilized as the proposed method's underlying optimization strategy. First, we construct two views of a chosen data set, utilizing knowledge from two different biological data sources. Next, we develop the MMCo- <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Clus algorithm considering the constructed views to identify a set of “good” co-clustering solutions. Finally, based on a concept of <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">consensus operation on the co-clustering outcome, a small number of most relevant and non-redundant features are extracted from the original feature-space. The reduced dimension formed by new feature-space causes to decrease the computational burden and noise level of original data. For experimental analysis, we have chosen three benchmark GE data sets. Our feature selection method's effectiveness is evaluated through sample-classification accuracy, accompanied by the cluster profile plot/Eisen plot/t-SNE plot, and biological/statistical significance test. A thorough comparative analysis with existing feature selection algorithms using external and internal evaluation metrics supports our proposed method's potency.

Read full abstract

High Dimensional Gene Expression Datasets Research Articles

Related Topics

Articles published on High Dimensional Gene Expression Datasets

NNFSRR: Nearest Neighbor Feature Selection and Redundancy Removal Method for Nearest Neighbor Search in Microarray Gene Expression Data

Gene selection with Game Shapley Harris hawks optimizer for cancer classification

Feature selection for high dimensional microarray gene expression data via weighted signal to noise ratio.

An entropy-based density peak clustering for numerical gene expression datasets

Efficient Selection of Gaussian Kernel SVM Parameters for Imbalanced Data

A multi-objective evolutionary algorithm with decomposition and the information feedback for high-dimensional medical data

A high-dimensional feature selection method based on modified Gray Wolf Optimization

A developed ant colony algorithm for cancer molecular subtype classification to reveal the predictive biomarker in the renal cell carcinoma

HighMLR: An open-source package for R with machine learning for feature selection in high dimensional cancer clinical genome time to event data

An ensemble machine learning model based on multiple filtering and supervised attribute clustering algorithm for classifying cancer samples.

Coupling sparse Cox models with clustering of longitudinal transcriptomics data for trauma prognosis

MMCo-Clus – An Evolutionary Co-clustering Algorithm for Gene Selection

FS-GBDT: identification multicancer-risk module via a feature selection algorithm by integrating Fisher score and GBDT.

Weighted L1-norm Logistic Regression for Gene Selection of Microarray Gene Expression Classification

Integrative biomarker detection on high-dimensional gene expression data sets: a survey on prior knowledge approaches.

Sparse feature selection: Relevance, redundancy and locality structure preserving guided by pairwise constraints

Pathways Enrichment Analysis of Gene Expression Data in Type 2 Diabetes.

Gaussian mixture copulas for high-dimensional clustering and dependency-based subtyping

Visualisation and Modelling of High-Dimensional Cancerous Gene Expression Dataset

Fuzzy magnetic optimization clustering algorithm with its application to health care

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

High Dimensional Gene Expression Datasets Research Articles

Related Topics

Articles published on High Dimensional Gene Expression Datasets

NNFSRR: Nearest Neighbor Feature Selection and Redundancy Removal Method for Nearest Neighbor Search in Microarray Gene Expression Data

Gene selection with Game Shapley Harris hawks optimizer for cancer classification

Feature selection for high dimensional microarray gene expression data via weighted signal to noise ratio.

An entropy-based density peak clustering for numerical gene expression datasets

Efficient Selection of Gaussian Kernel SVM Parameters for Imbalanced Data

A multi-objective evolutionary algorithm with decomposition and the information feedback for high-dimensional medical data

A high-dimensional feature selection method based on modified Gray Wolf Optimization

A developed ant colony algorithm for cancer molecular subtype classification to reveal the predictive biomarker in the renal cell carcinoma

HighMLR: An open-source package for R with machine learning for feature selection in high dimensional cancer clinical genome time to event data

An ensemble machine learning model based on multiple filtering and supervised attribute clustering algorithm for classifying cancer samples.

Coupling sparse Cox models with clustering of longitudinal transcriptomics data for trauma prognosis

MMCo-Clus – An Evolutionary Co-clustering Algorithm for Gene Selection

FS-GBDT: identification multicancer-risk module via a feature selection algorithm by integrating Fisher score and GBDT.

Weighted L1-norm Logistic Regression for Gene Selection of Microarray Gene Expression Classification

Integrative biomarker detection on high-dimensional gene expression data sets: a survey on prior knowledge approaches.

Sparse feature selection: Relevance, redundancy and locality structure preserving guided by pairwise constraints

Pathways Enrichment Analysis of Gene Expression Data in Type 2 Diabetes.

Gaussian mixture copulas for high-dimensional clustering and dependency-based subtyping

Visualisation and Modelling of High-Dimensional Cancerous Gene Expression Dataset

Fuzzy magnetic optimization clustering algorithm with its application to health care