An Accurate and Efficient Large-scale Regression Method through Best Friend Clustering

Kun Li,Liang Yuan,Gongwei Chen,Yunquan Zhang

doi:10.1109/tpds.2021.3134336

Kun Li, Liang Yuan + Show 2 more

Open Access

https://doi.org/10.1109/tpds.2021.3134336

Copy DOI

Journal: IEEE Transactions on Parallel and Distributed Systems	Publication Date: Jan 1, 2021
Citations: 1	License type: publisher-specific-oa

Abstract

As the data size in Machine Learning fields grows exponentially, it is inevitable to accelerate the computation by utilizing the ever-growing large number of available cores provided by high-performance computing hardware. However, existing parallel methods for clustering or regression often suffer from problems of low accuracy, slow convergence, and complex hyperparameter-tuning. Furthermore, the parallel efficiency is usually difficult to improve while striking a balance between preserving model properties and partitioning computing workloads on distributed systems. In this paper, we propose a novel and simple data structure capturing the most important information among data samples. It has several advantageous properties supporting a hierarchical clustering strategy that is irrelevant to the hardware parallelism, well-defined metrics for determining optimal clustering, balanced partition for maintaining the compactness property, and efficient parallelization for accelerating computation phases. Then we combine the clustering with regression techniques as a parallel library and utilize a hybrid structure of data and model parallelism to make predictions. Experiments illustrate that our library obtains remarkable performance on convergence, accuracy, and scalability.

Full Text