An entropy-based density peak clustering for numerical gene expression datasets

Rashmi Maheshwari,Amaresh Chandra Mishra,Sraban Kumar Mohanty

doi:10.1016/j.asoc.2023.110321

Abstract

In molecular biology, gene expression analysis is one of the important research areas which deals with identifying the genes having similar functionality known as co-expressed genes. Data mining techniques like clustering are frequently employed for grouping gene expressions with similar functional characteristics. Numerous such clustering techniques are available for gene expression analysis. Usually, gene expression datasets are a result of millions of measurements due to which they possess high dimensionality and noise which makes the conventional distance measures ineffective. On the other hand, entropy-based distance computation is much more efficient to capture the inhomogeneity in large dimensional data and is also quite insensitive to noise. To exploit these advantages, we propose a novel method to compute the density distribution of data points in high-dimensional and noisy gene expression datasets using the concept of entropy. After obtaining the density distribution, an existing technique known as “Extreme Clustering” is used to obtain the desired clusters present in the gene expressions dataset. The proposed technique is implemented and evaluated on diversified microarray gene expression datasets. Experiment results show that the proposed technique outperforms other popular density-based techniques in terms of cluster quality, robustness against noise, and biological significance of the genes within the clusters.

Full Text