On the Selection of Appropriate Proximity Measurement for Gene Expression Data

Md Bipul Hossen

doi:10.11648/j.ijbmr.20170505.11

Abstract

Gene expression profile has become a useful biological resource in recent years and its plays an important role in a broad range of biology. But a large number of genes and the complexity of biological networks greatly increase the evaluation of comprehending and interpreting the resulting mass of data, which often consists of millions of measurements. In the computational analysis of gene expression data, the main aspect is to finding co-expressed genes as the proximity (similarity or dissimilarity) measures that are used in the clustering method. Several number of proximity measures work are used in the gene data but the majority of these works has given emphasis on the biological results and no critical assessment of the suitability of the proximity measures for the analysis of gene expression data. For these consequences this paper is to investigate the appropriate proximity measurement for gene expression data. As a case study, we considered six real datasets. Based on this, we provide a comparative study of five proximity measures: Euclidean distance, Manhattan distance, Pearson correlation, Spearman correlation, Cosine distance. We discuss Adjusted Rand Index, Silhouette Index of clustering to assess the quality and reliability of the results. Our results reveal that the Cosine distance method with complete linkage exhibited the best performance for both Affymetrix and cDNA datasets according to Adjusted Rand Index. Our results also reveal that the Spearman correlation measure with complete linkage exhibited the best performance for both Affymetrix and cDNA datasets according to Silhouette Index.

Full Text