Appliance of effective clustering technique for gene expression datasets using GPU

V Saveetha,P D R Vijayakumar,S Sophia

doi:10.1007/s10586-017-1621-x

Abstract

The study of medical datasets for analytical purpose is made possible by the innovation of different data mining techniques. Microarrays make simultaneous monitoring of genes under several conditions. Finding out co-expressed genes and coherent patterns is the main goal in bioinformatics research. Cluster analysis of gene expression data has been proven to be a valuable tool for finding biologically groups of genes. The mutual information criteria of the algorithm try to measure the dependency among gene variables. Simulated annealing is applied to solve the local minima problem of K-means algorithm. The improvements in the algorithm utilized further enhances with the use of parallelization techniques. The computational tasks in data mining can be effectively performed by graphics processing units (GPU). An optimized K-means implementation on the GPU using compute unified device architecture (CUDA) of NVIDIA is used as the programming environment. Importance is given on optimizations directly working on data parallel architecture to best use the computational capabilities available. The algorithm is performed in a hybrid manner, parallelizing simulated annealing K-means based on mutual information criteria (MIK). A performance study on medical dataset is performed, demonstrating a maximum 7$$\times $$ speed increase. Experimental analysis shows that the proposed method performs well on gene expression data. The performances of the new clustering methods are compared with those of some existing methods. It is seen that the clustering algorithm based on a combined metric of mutual information and Euclidean distance metric achieves the best performance.

Full Text