Abstract

Tabular Clinical and Biomedical Routine Data (CBRD) contains diverse feature types. Recent research shows that the conventional application of Uniform Manifold Projection and Approximation (UMAP) to extract clusters from the low dimensional embedding can prove ineffective due to the diverse feature types in such datasets. Feature-type Distributed Clustering (FDC) workflow accounts for these diverse feature types resulting in a more informative low-dimensional embedding. However, a rigorous assessment of the FDC algorithm is missing so far. In this work, we conducted comprehensive benchmarking experiments to compare the quality of the cluster distributions and low dimensional embeddings generated by the FDC against that of the ones generated by UMAP using standard objective measures: Silhouette score, Dunn index, and ANOVA. Our results confirm that FDC can indeed be the better choice to embed tabular data with diverse feature types in low dimensions and thereby extract clusters from such an embedding. In addition, we provide a rationale behind the choice of metrics proposed in the FDC workflow. Moreover, we also point out some problems with the original Canberra metric used to reduce ordinal features in the FDC workflow and provide a solution in the form of a modified version of the Canberra metric. Using seven datasets from the medical domain for benchmarking, we demonstrate that FDC leads to improved patient stratification.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.