An Iterative Data Mining Approach for Mining Overlapping Coexpression Patterns in Noisy Gene Expression Data

P.C.H Ma,K.C.C Chan

doi:10.1109/tnb.2009.2026747

Abstract

Clustering is concerned with the discovery of groupings of records in a database. Many clustering problems are defined as partitioning problems in the sense that the similar records are grouped into nonoverlapping partitions. However, the clustering of gene expression data to discover coexpressed genes may not always be meaningful if this problem is reduced into a partitioning problem. Due to the complexity of the underlying biological processes, a protein can interact with one or more other proteins belonging to different functional classes in order to perform a particular biological role. For this reason, when responding to different external stimulants, a gene that produces a particular protein can coexpress with more than one group of other genes. The gene can therefore belong to more than one group of coexpressed genes. This poses a challenge to many clustering algorithms as they are not originally developed to discover overlapping clusters in noisy gene expression data. In this paper, we propose an iterative data mining approach that consists of two phases as follows. In phase 1, a clustering algorithm is used to discover the initial, nonoverlapping partitioning of gene expression profiles in gene expression data. Then, the partition memberships of genes are redetermined iteratively in phase 2 by a pattern discovery technique so as to determine that if a gene should remain in the same partition, be moved to another partition, or be also grouped together with other genes in another partitions. The proposed approach has been tested with both artificial and real datasets. Experimental results show that it can improve the performances of existing clustering algorithms and is able to effectively discover overlapping clusters in noisy gene expression data.

Full Text