Data Setting Research Articles

The choice of the most appropriate unsupervised machine-learning method for “heterogeneous” or “mixed” data, i.e. with both continuous and categorical variables, can be challenging. Our aim was to examine the performance of various clustering strategies for mixed data using both simulated and real-life data. We conducted a benchmark analysis of “ready-to-use” tools in R comparing 4 model-based (Kamila algorithm, Latent Class Analysis, Latent Class Model [LCM] and Clustering by Mixture Modeling) and 5 distance/dissimilarity-based (Gower distance or Unsupervised Extra Trees dissimilarity followed by hierarchical clustering or Partitioning Around Medoids, K-prototypes) clustering methods. Clustering performances were assessed by Adjusted Rand Index (ARI) on 1000 generated virtual populations consisting of mixed variables using 7 scenarios with varying population sizes, number of clusters, number of continuous and categorical variables, proportions of relevant (non-noisy) variables and degree of variable relevance (low, mild, high). Clustering methods were then applied on the EPHESUS randomized clinical trial data (a heart failure trial evaluating the effect of eplerenone) allowing to illustrate the differences between different clustering techniques. The simulations revealed the dominance of K-prototypes, Kamila and LCM models over all other methods. Overall, methods using dissimilarity matrices in classical algorithms such as Partitioning Around Medoids and Hierarchical Clustering had a lower ARI compared to model-based methods in all scenarios. When applying clustering methods to a real-life clinical dataset, LCM showed promising results with regard to differences in (1) clinical profiles across clusters, (2) prognostic performance (highest C-index) and (3) identification of patient subgroups with substantial treatment benefit. The present findings suggest key differences in clustering performance between the tested algorithms (limited to tools readily available in R). In most of the tested scenarios, model-based methods (in particular the Kamila and LCM packages) and K-prototypes typically performed best in the setting of heterogeneous data.

Read full abstract

Choice-based conjoint (CBC) is nowadays the most widely used variant of conjoint analysis, a class of methods for measuring consumer preferences. The primary reason for the increasing dominance of the CBC approach over the last 35 years is that it closely mimics real choice behavior of consumers by asking respondents repeatedly to choose their preferred alternative from a set of several offered alternatives (choice sets). Within the framework of CBC analysis, the multinomial logit (MNL) model is the most frequently used discrete choice model due to the existence of closed form solutions for conditional choice probabilities. The popularity of CBC and the MNL model has grown even more since the introduction of hierarchical Bayesian (HB) estimation techniques that accommodate individual consumer heterogeneity in choice data, and which have now become state-of-the-art in marketing theory and practice. Still, researchers and practitioners have to make further decisions under this framework (CBC, MNL, HB estimation), such as how to represent preference heterogeneity. Here, using a normal distribution (and therefore a unimodal distribution) has become the standard approach in the marketing literature. However, the thin tails of the normal distribution suggest that the standard HB-MNL model should not be the “go-to” approach to approximate multimodal preference distributions, because individual preference patterns lying at the tails of the normal distribution (i.e., that do not fit well with the assumption of a unimodal distribution) tend to be shrunk to the population mean. This shrinkage, especially in multimodal data settings, could mask important information (e.g., new or different structures in the data). A mixture of normal distributions avoids this limited flexibility of the most simple continuous approach of assuming a unimodal prior heterogeneity distribution. There are currently two prominent HB-CBC modeling approaches embedding the mixture-of-normals (MoN) approach: the more widespread MoN-HB-MNL model, and the Dirichlet process mixture (DPM)-HB-MNL model. In this article, we review the prominent HB-MNL model (with its normal prior), the MoN-HB-MNL model, and the DPM-HB-MNL model and apply them to an empirical multi-country CBC data set. We compare the statistical performance of the three models in terms of goodness-of-fit and predictive accuracy, show how to include consumer background characteristics in the upper level of these models, and illustrate how to interpret the estimation results (with a special focus on cross-county heterogeneity). In sum, our article serves as a kind of user guide to the estimation and interpretation of Hierarchical Bayes Conjoint Choice Models. For our data, we observed that all three choice models (both with and without consumer background characteristics) resulted in a one-component solution. The DPM-HB-MNL model nevertheless yielded a higher cross-validated hit rate compared to the MoN-HB-MNL and the HB-MNL models due to its even more flexible prior assumptions. The two latter models tended to slightly overfit our empirical data, which was reflected by higher goodness-of-fit statistics but a lower predictive accuracy compared to the DPM-HB-MNL model. We showed that this result could be attributed to the weaker extent of Bayesian shrinkage of these two models. The DPM-HB-MNL model showed a stronger shrinkage effect and seems therefore somewhat more robust against overfitting. Including consumer background characteristics in terms of country of origin information for the respondents did not improve the statistical model performance (especially not the predictive performance). Still, using the country of origin information for respondents in a post-hoc segmentation analysis helped us to explain some differences in brand preferences between the five countries.

Read full abstract

Data Setting Research Articles

Related Topics

Articles published on Data Setting

Machine learning in pharmacometrics: Opportunities and challenges.

Distributed Bayesian Inference in Linear Mixed-Effects Models

Head-to-head comparison of clustering methods for heterogeneous data: a simulation-driven benchmark

Demosaicing by Differentiable Deep Restoration

Website Fingerprinting in the Age of QUIC

Change-Point Detection for Graphical Models in the Presence of Missing Values

Tensor Canonical Correlation Analysis With Convergence and Statistical Guarantees

False discovery rate for functional data

Functional analysis of generalized linear models under non-linear constraints with applications to identifying highly-cited papers

An efficient numerical method for condition number constrained covariance matrix approximation

Concentration bounds for temporal difference learning with linear function approximation: the case of batch data and uniform sampling

Analyzing state government spending: balanced budget rules or forward-looking decisions?

Coresets for Regressions with Panel Data

A Correlated Noise-assisted Decentralized Differentially Private Estimation Protocol, and its application to fMRI Source Separation.

Semi-parametric Estimation of Biomarker Age Trends with Endogenous Medication Use in Longitudinal Data

Unsupervised Adaptation for High-Dimensional with Limited-Sample Data Classification Using Variational Autoencoder

Hierarchical Bayes Conjoint Choice Models - Model Framework, Bayesian Inference, Model Selection, and Interpretation of Estimation Results

Compositional trend filtering

A computational workflow to explore material properties of existing settings of point cloud data

A Simulation Study for Performance Comparison between Generalized Linear Mixed Modeling (GLMM) and Generalized Regression Neural Networks (GRNN) in Joint Modeling of Mixed Responses

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Data Setting Research Articles

Related Topics

Articles published on Data Setting

Machine learning in pharmacometrics: Opportunities and challenges.

Distributed Bayesian Inference in Linear Mixed-Effects Models

Head-to-head comparison of clustering methods for heterogeneous data: a simulation-driven benchmark

Demosaicing by Differentiable Deep Restoration

Website Fingerprinting in the Age of QUIC

Change-Point Detection for Graphical Models in the Presence of Missing Values

Tensor Canonical Correlation Analysis With Convergence and Statistical Guarantees

False discovery rate for functional data

Functional analysis of generalized linear models under non-linear constraints with applications to identifying highly-cited papers

An efficient numerical method for condition number constrained covariance matrix approximation

Concentration bounds for temporal difference learning with linear function approximation: the case of batch data and uniform sampling

Analyzing state government spending: balanced budget rules or forward-looking decisions?

Coresets for Regressions with Panel Data

A Correlated Noise-assisted Decentralized Differentially Private Estimation Protocol, and its application to fMRI Source Separation.

Semi-parametric Estimation of Biomarker Age Trends with Endogenous Medication Use in Longitudinal Data

Unsupervised Adaptation for High-Dimensional with Limited-Sample Data Classification Using Variational Autoencoder

Hierarchical Bayes Conjoint Choice Models - Model Framework, Bayesian Inference, Model Selection, and Interpretation of Estimation Results

Compositional trend filtering

A computational workflow to explore material properties of existing settings of point cloud data

A Simulation Study for Performance Comparison between Generalized Linear Mixed Modeling (GLMM) and Generalized Regression Neural Networks (GRNN) in Joint Modeling of Mixed Responses