In this paper we explore the concept of consensus clustering to identify, within a set of differentially expressed genes, a subset of genes that are either highly coexpressed or highly noncoexpressed based on the hypothesis that this subset would serve as a better starting point for further analyses. A number of core clustering methods form the basis for the assertion of an agreement matrix (AM) characterizing the level of coexpression between any two probesets. In order to overcome the limitations of using a single distance metric, we explore different metrics and examine the sensitivity of the AM as a function of the input number of clusters to find a suggestive number of clusters that best describes a particular dataset. The result of this level of analysis is a systematic framework for eliminating probesets that cannot be clearly characterized as either coexpressed or noncoexpressed with others, thus eliminating a number of probesets from further analysis. Subsequently, an agglomerative hierarchical clustering approach is applied to cluster the selected subset using the agreement metric information as the similarity measure. Thus, the goal of the proposed methodology is twofold: (1) we opt to identify a more "clusterable" subset of the original set; and (2) we aim at further refining the subset in order to identify a core of genes that contains genes that are either coexpressed or noncoexpressed within a certain confidence level. The approach is tested with a number of data sets, both synthetic and real, and it is demonstrated that it is successful in identifying more clusterable, also hypothesized to be more biologically relevant, subsets of genes and expression profiles.
Read full abstract