High-dimensional Binary Data Research Articles

Data and knowledge management systems employ feature selection algorithms for removing irrelevant, redundant, and noisy information from the data. There are two well-known approaches to feature selection, feature ranking (FR) and feature subset selection (FSS). In this paper, we propose a new FR algorithm, termed as class-dependent density-based feature elimination (CDFE), for binary data sets. Our theoretical analysis shows that CDFE computes the weights, used for feature ranking, more efficiently as compared to the mutual information measure. Effectively, rankings obtained from both the two criteria approximate each other. CDFE uses a filtrapper approach to select a final subset. For data sets having hundreds of thousands of features, feature selection with FR algorithms is simple and computationally efficient but redundant information may not be removed. On the other hand, FSS algorithms analyze the data for redundancies but may become computationally impractical on high-dimensional data sets. We address these problems by combining FR and FSS methods in the form of a two-stage feature selection algorithm. When introduced as a preprocessing step to the FSS algorithms, CDFE not only presents them with a feature subset, good in terms of classification, but also relieves them from heavy computations. Two FSS algorithms are employed in the second stage to test the two-stage feature selection idea. We carry out experiments with two different classifiers (naive Bayes' and kernel ridge regression) on three different real-life data sets (NOVA, HIVA, and GINA) of the”Agnostic Learning versus Prior Knowledge” challenge. As a stand-alone method, CDFE shows up to about 92 percent reduction in the feature set size. When combined with the FSS algorithms in two-stages, CDFE significantly improves their classification accuracy and exhibits up to 97 percent reduction in the feature set size. We also compared CDFE against the winning entries of the challenge and found that it outperforms the best results on NOVA and HIVA while obtaining a third position in case of GINA.

Read full abstract

BackgroundGraphical models were identified as a promising new approach to modeling high-dimensional clinical data. They provided a probabilistic tool to display, analyze and visualize the net-like dependence structures by drawing a graph describing the conditional dependencies between the variables. Until now, the main focus of research was on building Gaussian graphical models for continuous multivariate data following a multivariate normal distribution. Satisfactory solutions for binary data were missing. We adapted the method of Meinshausen and Bühlmann to binary data and used the LASSO for logistic regression. Objective of this paper was to examine the performance of the Bolasso to the development of graphical models for high dimensional binary data. We hypothesized that the performance of Bolasso is superior to competing LASSO methods to identify graphical models.MethodsWe analyzed the Bolasso to derive graphical models in comparison with other LASSO based method. Model performance was assessed in a simulation study with random data generated via symmetric local logistic regression models and Gibbs sampling. Main outcome variables were the Structural Hamming Distance and the Youden Index.We applied the results of the simulation study to a real-life data with functioning data of patients having head and neck cancer.ResultsBootstrap aggregating as incorporated in the Bolasso algorithm greatly improved the performance in higher sample sizes. The number of bootstraps did have minimal impact on performance. Bolasso performed reasonable well with a cutpoint of 0.90 and a small penalty term. Optimal prediction for Bolasso leads to very conservative models in comparison with AIC, BIC or cross-validated optimal penalty terms.ConclusionsBootstrap aggregating may improve variable selection if the underlying selection process is not too unstable due to small sample size and if one is mainly interested in reducing the false discovery rate. We propose using the Bolasso for graphical modeling in large sample sizes.

Read full abstract

High-dimensional Binary Data Research Articles

Related Topics

Articles published on High-dimensional Binary Data

Feature selection via robust weighted score for high dimensional binary class-imbalanced gene expression data

DisTop: Discovering a Topological Representation to Learn Diverse and Rewarding Skills

Online monitoring of high-dimensional binary data streams with application to extreme weather surveillance

A Set of Efficient Methods to Generate High-Dimensional Binary Data With Specified Correlation Structures

Efficient mixture model for clustering of sparse high dimensional binary data

Principal component analysis of binary genomics data.

A family of block-wise one-factor distributions for modeling high-dimensional binary data

Two-sample tests for sparse high-dimensional binary data

Comparing Measures of Association in 2×2 Probability Tables

Calibrating an Ice Sheet Model Using High-Dimensional Binary Spatial Data

High Dimensional Logistic Regression Model using Adjusted Elastic Net Penalty

Feature extraction for proteomics imaging mass spectrometry data

Applying Penalized Binary Logistic Regression with Correlation Based Elastic Net for Variables Selection

Model based clustering of high-dimensional binary data

Bit-Table Based Biclustering and Frequent Closed Itemset Mining in High-Dimensional Binary Data

Two Expectation-Maximization algorithms for Boolean Factor Analysis

Supervised Bayesian latent class models for high‐dimensional data

Feature Selection Based on Class-Dependent Densities for High-Dimensional Binary Data

Graphical modeling of binary data using the LASSO: a simulation study

Estimating and testing conditional sums of means in high dimensional multivariate binary data

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

High-dimensional Binary Data Research Articles

Related Topics

Articles published on High-dimensional Binary Data

Feature selection via robust weighted score for high dimensional binary class-imbalanced gene expression data

DisTop: Discovering a Topological Representation to Learn Diverse and Rewarding Skills

Online monitoring of high-dimensional binary data streams with application to extreme weather surveillance

A Set of Efficient Methods to Generate High-Dimensional Binary Data With Specified Correlation Structures

Efficient mixture model for clustering of sparse high dimensional binary data

Principal component analysis of binary genomics data.

A family of block-wise one-factor distributions for modeling high-dimensional binary data

Two-sample tests for sparse high-dimensional binary data

Comparing Measures of Association in 2×2 Probability Tables

Calibrating an Ice Sheet Model Using High-Dimensional Binary Spatial Data

High Dimensional Logistic Regression Model using Adjusted Elastic Net Penalty

Feature extraction for proteomics imaging mass spectrometry data

Applying Penalized Binary Logistic Regression with Correlation Based Elastic Net for Variables Selection

Model based clustering of high-dimensional binary data

Bit-Table Based Biclustering and Frequent Closed Itemset Mining in High-Dimensional Binary Data

Two Expectation-Maximization algorithms for Boolean Factor Analysis

Supervised Bayesian latent class models for high‐dimensional data

Feature Selection Based on Class-Dependent Densities for High-Dimensional Binary Data

Graphical modeling of binary data using the LASSO: a simulation study

Estimating and testing conditional sums of means in high dimensional multivariate binary data