High-dimensional Biological Datasets Research Articles

Microarray technology, as applied to the fields of bioinformatics, biotechnology, and bioengineering, has made remarkable progress in both the treatment and prediction of many biological problems. However, this technology presents a critical challenge due to the size of the numerous genes present in the high-dimensional biological datasets associated with an experiment, which leads to a curse of dimensionality on biological data. Such high dimensionality of real biological data sets not only increases memory requirements and training costs, but also reduces the ability of learning algorithms to generalise. Consequently, multiple feature selection (FS) methods have been proposed by researchers to choose the most significant and precise subset of classified genes from gene expression datasets while maintaining high classification accuracy. In this research work, a novel binary method called iBABC-CGO based on the island model of the artificial bee colony algorithm, combined with the chaos game optimization algorithm and SVM classifier, is suggested for FS problems using gene expression data. Due to the binary nature of FS problems, two distinct transfer functions are employed for converting the continuous search space into a binary one, thus improving the efficiency of the exploration and exploitation phases. The suggested strategy is tested on a variety of biological datasets with different scales and compared to popular metaheuristic-based, filter-based, and hybrid FS methods. Experimental results supplemented with the statistical measures, box plots, Wilcoxon tests, Friedman tests, and radar plots demonstrate that compared to prior methods, the proposed iBABC-CGO exhibit competitive performance in terms of classification accuracy, selection of the most relevant subset of genes, data variability, and convergence rate. The suggested method is also proven to identify unique sets of informative, relevant genes successfully with the highest overall average accuracy in 15 tested biological datasets. Additionally, the biological interpretations of the selected genes by the proposed method are also provided in our research work.

Read full abstract

BackgroundA key question when analyzing high throughput data is whether the information provided by the measured biological entities (gene, metabolite expression for example) is related to the experimental conditions, or, rather, to some interfering signals, such as experimental bias or artefacts. Visualization tools are therefore useful to better understand the underlying structure of the data in a 'blind' (unsupervised) way. A well-established technique to do so is Principal Component Analysis (PCA). PCA is particularly powerful if the biological question is related to the highest variance. Independent Component Analysis (ICA) has been proposed as an alternative to PCA as it optimizes an independence condition to give more meaningful components. However, neither PCA nor ICA can overcome both the high dimensionality and noisy characteristics of biological data.ResultsWe propose Independent Principal Component Analysis (IPCA) that combines the advantages of both PCA and ICA. It uses ICA as a denoising process of the loading vectors produced by PCA to better highlight the important biological entities and reveal insightful patterns in the data. The result is a better clustering of the biological samples on graphical representations. In addition, a sparse version is proposed that performs an internal variable selection to identify biologically relevant features (sIPCA).ConclusionsOn simulation studies and real data sets, we showed that IPCA offers a better visualization of the data than ICA and with a smaller number of components than PCA. Furthermore, a preliminary investigation of the list of genes selected with sIPCA demonstrate that the approach is well able to highlight relevant genes in the data with respect to the biological experiment.IPCA and sIPCA are both implemented in the R package mixomics dedicated to the analysis and exploration of high dimensional biological data sets, and on mixomics' web-interface.

Read full abstract

High-dimensional Biological Datasets Research Articles

Articles published on High-dimensional Biological Datasets

Leveraging permutation testing to assess confidence in positive-unlabeled learning applied to high-dimensional biological datasets

Gene selection for high dimensional biological datasets using hybrid island binary artificial bee colony with chaos game optimization

Variable selection for nonlinear dimensionality reduction of biological datasets through bootstrapping of correlation networks

Efficient high-dimension feature selection based on enhanced equilibrium optimizer

Stochastic Mutual Information Gradient Estimation for Dimensionality Reduction Networks.

Distributed load balancing frequent colossal closed itemset mining algorithm for high dimensional dataset

PyBDA: a command line tool for automated analysis of big biological data sets

Penalized estimation of sparse concentration matrices based on prior knowledge with applications to placenta elemental data

New tools for the visualization of biological pathways.

A method for learning a sparse classifier in the presence of missing data for high-dimensional biological datasets.

Finding causative genes from high-dimensional data: an appraisal of statistical and machine learning approaches.

DATA REDUCTION TECHNIQUES FOR HIGH DIMENSIONAL BIOLOGICAL DATA

一种融合T-Rank和Softmax的特征提取算法研究

Independent Principal Component Analysis for biologically meaningful dimension reduction of large biological data sets

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

High-dimensional Biological Datasets Research Articles

Articles published on High-dimensional Biological Datasets

Leveraging permutation testing to assess confidence in positive-unlabeled learning applied to high-dimensional biological datasets

Gene selection for high dimensional biological datasets using hybrid island binary artificial bee colony with chaos game optimization

Variable selection for nonlinear dimensionality reduction of biological datasets through bootstrapping of correlation networks

Efficient high-dimension feature selection based on enhanced equilibrium optimizer

Stochastic Mutual Information Gradient Estimation for Dimensionality Reduction Networks.

Distributed load balancing frequent colossal closed itemset mining algorithm for high dimensional dataset

PyBDA: a command line tool for automated analysis of big biological data sets

Penalized estimation of sparse concentration matrices based on prior knowledge with applications to placenta elemental data

New tools for the visualization of biological pathways.

A method for learning a sparse classifier in the presence of missing data for high-dimensional biological datasets.

Finding causative genes from high-dimensional data: an appraisal of statistical and machine learning approaches.

DATA REDUCTION TECHNIQUES FOR HIGH DIMENSIONAL BIOLOGICAL DATA

一种融合T-Rank和Softmax的特征提取算法研究

Independent Principal Component Analysis for biologically meaningful dimension reduction of large biological data sets