Abstract

We introduce a graph-theoretic approach to extract clusters and hierarchies in complex data-sets in an unsupervised and deterministic manner, without the use of any prior information. This is achieved by building topologically embedded networks containing the subset of most significant links and analyzing the network structure. For a planar embedding, this method provides both the intra-cluster hierarchy, which describes the way clusters are composed, and the inter-cluster hierarchy which describes how clusters gather together. We discuss performance, robustness and reliability of this method by first investigating several artificial data-sets, finding that it can outperform significantly other established approaches. Then we show that our method can successfully differentiate meaningful clusters and hierarchies in a variety of real data-sets. In particular, we find that the application to gene expression patterns of lymphoma samples uncovers biologically significant groups of genes which play key-roles in diagnosis, prognosis and treatment of some of the most relevant human lymphoid malignancies.

Highlights

  • Filtering information out of complex datasets is becoming a central issue and a crucial bottleneck in any scientific endeavor

  • We apply the DBHT technique to various data sets ranging from artificial data with known clustering and hierarchical structures to real gene expression data

  • Comparisons are made between the results retrieved by the DBHT technique and some of state-of-the-art cluster analysis techniques such as kmeans++[29], Spectral clustering via Normalized cut on k-nearest neighbor graph [30,31], Self Organizing Map (SOM) [32] and Q-cut [33]

Read more

Summary

Introduction

Filtering information out of complex datasets is becoming a central issue and a crucial bottleneck in any scientific endeavor. The requirement of any prior information is a potential problem because often the filtering is one of the preliminary processing on the data and it is performed at a stage where very little information about the system is available Another difficulty may arise from the fact that, in some cases, the reduction of the system into a set of separated local communities may hide properties associated with the global organization. In the literature there exist several methods which can be used to extract clusters and hierarchies [1,2,3] and the application to biology and gene expression data has attracted a great attention in recent years [4,5,6,7] In these established approaches, to extract discrete clusters, one must input some a priori information about their number or define a thresholding value. We propose an alternative method that overcomes these limitations providing both clustering subdivision and hierarchical organization without the need of any prior information, without demanding supervision and without requiring thresholding

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.