Abstract

DBSCAN is a density-based data clustering algorithm; in image processing, data mining, machine learning and other fields are widely used. With the increasing of the size of clusters, the parallel DBSCAN algorithm is widely used. However, we consider current partitioning method of DBSCAN is too simple and steps of GETNEIGHBORS query repeatedly access the dataset on Spark. So we proposed DBSCAN-PSM which applies new data partitioning and merging method. In the first stage of our method, we import the KD-tree, combine the partitioning and GETNEIGHBORS query, reduce the number of access to the dataset and decrease the influence of I/O in the algorithm. In the second stage of our method, we use the feature of points in merging so as to avoid the time costing of the global label. Experimental results showed that our new method can improve the parallel efficiency and the clustering algorithm performance.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.