Introduction to Distributed Nearest Hash: On Further Optimizing Cloud Based Distributed kNN Variant

Subhrangshu Adhikary,Saikat Banerjee

doi:10.1016/j.procs.2023.01.135

Abstract

K-Nearest Neighbors (KNN) are very popular supervised machine learning algorithm which can be effectively used for both multi-class classification and regression utilizing multiple features of the dataset. It trains extremely fast, that is with time complexity of O(n). However, the time complexity for generating results with KNN is O(n3dk) (where k=number of neighbours, d=dimension of data, n=number of training points) and hence it becomes very difficult to work with larger dataset. Although there are parallelizing and distributed computing techniques available for speeding up KNN, it still have limitations on horizontal scalability, fault tolerance and incremental learning capabilities. To solve these issues, the paper presents a KNN inspired novel Supervised Machine Learning algorithm called Distributed Nearest Hash utilizing unique usage of hashmap and primary key clustering order with help of wide column store databases facilitating easy incremental near-realtime scalability to train unlimited datapoints with O(n) time complexity and also predict them in near realtime with O(1) time complexity considering n >> d. This can be readily integrated in any system that uses KNN as a base classifier or regressor. The experimental results shows that the proposed DNH model is 25% faster than state of the art distributed KNN algorithm. Integrating with fuzzy logic the algorithm can be used for several class classification. Using this could aid in building very powerful smart decision making systems with very little computational power.

Full Text