Semi-Supervised Clustering Models for Clinical Risk Assessment

Yongyang Huo,Paul Mccullagh,Francisco Azuaje,Roy Harper

doi:10.1109/bibe.2006.253341

Abstract

Clustering methods aim to organize a collection of cases into groupings, such that cases within one cluster are more similar to each other than to those in other clusters. A small amount of background knowledge may also be used to guide the clustering process and aid in the interpretation of results. This type of knowledge-driven clustering is known as semi-supervised clustering. This knowledge may be represented by pairwise constraints, labelled cases or known data groupings. Pairwise constraints may be specified, for example, as ?MustLink? or ?CannotLink? associations between cases. This research proposes a semi-supervised clustering method that exploits pairwise constraints and similarity information extracted from constrained cases. This semi-supervised clustering algorithm was first evaluated on publicly-available biomedical datasets. It was then applied to a Type II diabetes dataset to assess coronary heart disease (CHD) complication. This dataset comprises laboratory and physiological information from diabetic patients at the Ulster Hospital (UH) in Northern Ireland. The following methods were compared: traditional k-means, constraint-based k-means with pairwise constraints (CK method) and similarity-driven constraint-based k-means (SCK method). Results showed that the predictive quality, i.e. detection of relevant partitions and significant clusters, on these datasets was improved with a small amount of supervision (i.e. pairwise constraints automatically generated from the predefined class labels). Furthermore, the results from the UH dataset suggest significant associations between clustering outcomes with CHD complication in Type II diabetes patients.

Full Text