Abstract

BackgroundIn genomics, hierarchical clustering (HC) is a popular method for grouping similar samples based on a distance measure. HC algorithms do not actually create clusters, but compute a hierarchical representation of the data set. Usually, a fixed height on the HC tree is used, and each contiguous branch of samples below that height is considered a separate cluster. Due to the fixed-height cutting, those clusters may not unravel significant functional coherence hidden deeper in the tree. Besides that, most existing approaches do not make use of available clinical information to guide cluster extraction from the HC. Thus, the identified subgroups may be difficult to interpret in relation to that information.ResultsWe develop a novel framework for decomposing the HC tree into clusters by semi-supervised piecewise snipping. The framework, called guided piecewise snipping, utilizes both molecular data and clinical information to decompose the HC tree into clusters. It cuts the given HC tree at variable heights to find a partition (a set of non-overlapping clusters) which does not only represent a structure deemed to underlie the data from which HC tree is derived, but is also maximally consistent with the supplied clinical data. Moreover, the approach does not require the user to specify the number of clusters prior to the analysis. Extensive results on simulated and multiple medical data sets show that our approach consistently produces more meaningful clusters than the standard fixed-height cut and/or non-guided approaches.ConclusionsThe guided piecewise snipping approach features several novelties and advantages over existing approaches. The proposed algorithm is generic, and can be combined with other algorithms that operate on detected clusters. This approach represents an advancement in several regards: (1) a piecewise tree snipping framework that efficiently extracts clusters by snipping the HC tree possibly at variable heights while preserving the HC tree structure; (2) a flexible implementation allowing a variety of data types for both building and snipping the HC tree, including patient follow-up data like survival as auxiliary information.The data sets and R code are provided as supplementary files. The proposed method is available from Bioconductor as the R-package HCsnip.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-014-0448-1) contains supplementary material, which is available to authorized users.

Highlights

  • In genomics, hierarchical clustering (HC) is a popular method for grouping similar samples based on a distance measure

  • Since there are no explicit clusters in the HC output, clusters are obtained either manually by visually inspecting the tree structure or by cutting the HC tree at a specific height, after which the resulting connected components are treated as clusters

  • Comparison For the data sets above, performance of the guided piecewise snipping was compared with a)unguided piecewise snipping, which makes no use of survival data; and b) guided and unguided fixed-height cuts

Read more

Summary

Introduction

Hierarchical clustering (HC) is a popular method for grouping similar samples based on a distance measure. Due to the fixed-height cutting, those clusters may not unravel significant functional coherence hidden deeper in the tree. Most existing approaches do not make use of available clinical information to guide cluster extraction from the HC. Since there are no explicit clusters in the HC output, clusters are obtained either manually by visually inspecting the tree structure or by cutting the HC tree at a specific height, after which the resulting connected components are treated as clusters. The latter (referred as the fixedheight cut hereafter) is a simple, yet elegant technique commonly used in practice. Since the HC does not utilize the available clinical data, there is no guarantee that the identified subtypes will exhibit significant functional coherence

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call