Performing network-based analysis on medical and biological data makes a wide variety of machine learning tools available. Clustering, which can be used for classification, presents opportunities for identifying hard-to-reach groups for the development of customized health interventions. Due to a desire to convert abundant DNA gene co-expression data into networks, many graph inference methods have been developed. Likewise there are many clustering and classification tools. This paper presents a comparison of techniques for graph inference and clustering, using different numbers of features, in order to select the best tuple of graph inference method, clustering method, and number of features according to a particular phenotype. An extensive machine learning based analysis of the REGARDS dataset is conducted, evaluating the CoNet and K-Nearest Neighbors (KNN) network inference methods, along with the Louvain, Leiden and NBR-Clust clustering techniques. Results from analysis involving five internal cluster evaluation indices show the traditional KNN inference method and NBR-Clust and Louvain clustering produce the most promising clusters with medical phenotype data. It is also shown that visualization can aid in interpreting the clusters, and that the clusters produced can identify meaningful groups indicating customized interventions.
Read full abstract