Abstract

BackgroundClustering techniques are routinely used in gene expression data analysis to organize the massive data. Clustering techniques arrange a large number of genes or assays into a few clusters while maximizing the intra-cluster similarity and inter-cluster separation. While clustering of genes facilitates learning the functions of un-characterized genes using their association with known genes, clustering of assays reveals the disease stages and subtypes. Many clustering algorithms require the user to specify the number of clusters a priori. A wrong specification of number of clusters generally leads to either failure to detect novel clusters (disease subtypes) or unnecessary splitting of natural clusters.ResultsWe have developed a novel method to find the number of clusters in gene expression data. Our procedure evaluates different partitions (each with different number of clusters) from the clustering algorithm and finds the partition that best describes the data. In contrast to the existing methods that evaluate the partitions independently, our procedure considers the dynamic rearrangement of cluster members when a new cluster is added. Partition quality is measured based on a new index called Net InFormation Transfer Index (NIFTI) that measures the information change when an additional cluster is introduced. Information content of a partition increases when clusters do not intersect and decreases if they are not clearly separated. A partition with the highest Total Information Content (TIC) is selected as the optimal one. We illustrate our method using four publicly available microarray datasets.ConclusionIn all four case studies, the proposed method correctly identified the number of clusters and performs better than other well known methods. Our method also showed invariance to the clustering techniques.

Highlights

  • Clustering techniques are routinely used in gene expression data analysis to organize the massive data

  • Clustering is a statistical technique that partitions a large number of objects into a few clusters such that objects within the same cluster are more similar to each other than to the objects in other clusters

  • Since gene clusters are often enriched with genes involving in common biological processes, identifying such clusters discloses potential roles of previously un-characterized genes and provides insights into gene regulation

Read more

Summary

Introduction

Clustering techniques are routinely used in gene expression data analysis to organize the massive data. Clustering techniques arrange a large number of genes or assays into a few clusters while maximizing the intra-cluster similarity and inter-cluster separation. Clustering is widely used in gene expression data analysis to cluster genes and/or samples (assays) based on their similarity in expression patterns. Despite the widespread use of clustering algorithms in gene expression data analysis [1,2,3,4,5,6], selection of clustering parameters continues to be a challenge. The optimal specification of number of clusters, k, is difficult especially if there is inadequate biological understanding of the system [7]. While the correct number of clusters can be identified by visual inspection in some cases, in most gene expression datasets, the data dimensions are too high for effective visualization. Methods that find the optimal number of clusters are essential

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call