Abstract

The results of clustering are often affected by covariates that are independent of the clusters one would like to discover. Traditionally, alternative clustering algorithms can be used to solve such clustering problems. However, these suffer from at least one of the following problems: (1) Continuous covariates or nonlinearly separable clusters cannot be handled; (2) assumptions are made about the distribution of the data; (3) one or more hyper-parameters need to be set. The presence of covariates also has an effect in a different type of problem such as semi-supervised learning. To the best of our knowledge, there is no existing method addressing the semi-supervised learning setting in the presence of covariates. Here we propose two novel algorithms, named kernel conditional clustering (KCC) and kernel conditional semi-supervised learning (KCSSL), whose objectives are derived from a kernel-based conditional dependence measure. KCC is parameter-light and makes no assumptions about the cluster structure, the covariates, or the distribution of the data, while KCSSL is fully parameter-free. On both simulated and real-world datasets, the proposed KCC and KCSSL algorithms perform better than state-of-the-art methods. The former detects the ground truth cluster structures more accurately, and the latter makes more accurate predictions.

Highlights

  • In many applications, labeling samples by domain experts is extremely expensive, e.g., diagnoses in the biomedical domain

  • We propose kernel conditional clustering (KCC) and kernel conditional semi-supervised learning (KCSSL), two de novo algorithms that use an extension of Hilbert–Schmidt Independence Criterion (HSIC), known as the Hilbert–Schmidt Conditional Independence Criterion (HSCONIC) [14,15], to solve the conditional clustering problem and the conditional semi-supervised learning problem, respectively

  • Among the comparison partners for conditional clustering, many methods work for the Gaussian case (Simu1), including OC, RPCA, KDAC, and KCC

Read more

Summary

Introduction

In many applications, labeling samples by domain experts is extremely expensive, e.g., diagnoses in the biomedical domain. A fundamental problem that is intrinsic to both, clustering and semi-supervised learning, is that the relationships one expects to uncover are often driven by the presence and by the effect of covariates associated with the data. The structure these covariates impose on the data is often trivial to find and irrelevant to the interesting structure or relationship one hopes to discover. We focus on the problems of conditional clustering and conditional semi-supervised learning, respectively Their aim is to maximize the dependence between the data and the clustering/label assignment, conditioned on known covariates

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call