AbstractK-hyperparameter optimization in high-dimensional genomics remains a critical challenge, impacting the quality of clustering. Improved quality of clustering can enhance models for predicting patient outcomes and identifying personalized treatment plans. Subsequently, these enhanced models can facilitate the discovery of biomarkers, which can be essential for early diagnosis, prognosis, and treatment response in cancer research. Our paper addresses this challenge through a four-fold approach. Firstly, we empirically evaluate the k-hyperparameter optimization algorithms in genomics analysis using a correlation based feature selection method and a stratified k-fold cross-validation strategy. Secondly, we evaluate the performance of the best optimization algorithm in the first step using a variety of the dimensionality reduction methods applied for reducing the hyperparameter search spaces in genomics. Building on the two, we propose a novel algorithm for this optimization problem in the third step, employing a joint optimization of Deep-Differential-Evolutionary Algorithm and Unsupervised Transfer Learning from Intelligent GenoUMAP (Uniform Manifold Approximation and Projection). Finally, we compare it with the existing algorithms and validate its effectiveness. Our approach leverages UMAP pre-trained special autoencoder and integrates a deep-differential-evolutionary algorithm in tuning k. These choices are based on empirical analysis results. The novel algorithm balances population size for exploration and exploitation, helping to find diverse solutions and the global optimum. The learning rate balances iterations and convergence speed, leading to stable convergence towards the global optimum. UMAP’s superior performance, demonstrated by short whiskers and higher median values in the comparative analysis, informs its choice for training the special autoencoder in the new algorithm. The algorithm enhances clustering by balancing reconstruction accuracy, local structure preservation, and cluster compactness. The comprehensive loss function optimizes clustering quality, promotes hyperparameter diversity, and facilitates effective knowledge transfer. This algorithm’s multi-objective joint optimization makes it effective in genomics data analysis. The validation on this algorithm on three genomic datasets demonstrates superior clustering scores. Additionally, the convergence plots indicate relatively smoother curves and an excellent fitness landscape. These findings hold significant promise for advancing cancer research and computational genomics at large.
Read full abstract