A Python Clustering Analysis Protocol of Genes Expression Data Sets.

Giuseppe Agapito,Marianna Milano,Mario Cannataro

doi:10.3390/genes13101839

Giuseppe Agapito, Marianna Milano + Show 1 more

Open Access

https://doi.org/10.3390/genes13101839

Copy DOI

Journal: Genes	Publication Date: Oct 12, 2022
Citations: 5	License type: CC BY 4.0

Affiliation: Magna Graecia University

Abstract

Gene expression and SNPs data hold great potential for a new understanding of disease prognosis, drug sensitivity, and toxicity evaluations. Cluster analysis is used to analyze data that do not contain any specific subgroups. The goal is to use the data itself to recognize meaningful and informative subgroups. In addition, cluster investigation helps data reduction purposes, exposes hidden patterns, and generates hypotheses regarding the relationship between genes and phenotypes. Cluster analysis could also be used to identify bio-markers and yield computational predictive models. The methods used to analyze microarrays data can profoundly influence the interpretation of the results. Therefore, a basic understanding of these computational tools is necessary for optimal experimental design and meaningful data analysis. This manuscript provides an analysis protocol to effectively analyze gene expression data sets through the K-means and DBSCAN algorithms. The general protocol enables analyzing omics data to identify subsets of features with low redundancy and high robustness, speeding up the identification of new bio-markers through pathway enrichment analysis. In addition, to demonstrate the effectiveness of our clustering analysis protocol, we analyze a real data set from the GEO database. Finally, the manuscript provides some best practice and tips to overcome some issues in the analysis of omics data sets through unsupervised learning.

Full Text