Abstract

Biological data, and in particular imaging data, have experienced an exponential growth in terms of volume and complexity in the last few years, raising new challenges in the field of machine learning. Unsupervised problems are of particular relevance, as the generation of labels for the data is often labor-intensive, expensive or simply not possible. However, interpretability of the data and the results is key to extract new valuable knowledge from the large-scale datasets that are studied. This highlights the necessity of adequate unsupervised dimensionality reduction techniques that can lower the computational workload necessary to process the dataset, while at the same time providing information on its structure. This paper describes a framework that brings together previous proposals on unsupervised feature clustering, with the goal of providing a scalable, interpretable and robust dimensionality reduction on single-cell imaging data. The framework integrates several inter-feature dissimilarity measures, clustering algorithms, quality criteria to select the best feature clustering, and dimensionality reduction methods that are built on the clustering. For each of these components, several approaches proposed in previous works have been tested and evaluated on three use cases coming from two different imaging datasets, highlighting the best-performing components. Affinity clustering is applied for feature clustering for the first time. The results were validated using statistical tests, showing that many of the combinations tested lowered the complexity of the datasets while maintaining or improving the accuracy yielded by classifiers applied on them. The analysis highlighted affinity clustering as the best algorithm for feature clustering, with median differences of up to 8.9% and 0.9% in accuracy with respect to FSFS and hierarchical clustering. Representation entropy obtained a median difference of 13.0% and 0.8% with respect to class separability and silhouette index, respectively, as a robust unsupervised criterion to select the cluster set. Dissimilarities based on Pearson’s correlation performed slightly better than the alternatives, with a median improvement of 2.8% with respect to the cosine distance.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call