Abstract
Similarity join has been widely used in many data analysis and data mining applications, we mainly focus on the scalability and performance problem of similarity join query on massive high-dimensional data set. p-stable distribution based projection scheme can implement dimension reduction effectively. Three novel approaches based on projection scheme are proposed to deal with massive high-dimensional data similarity join problem: Single projection method, Multiple projection method and Projection space partitioning method. Comprehensive experimental tests were performed to evaluate the performance of the above approaches. The experimental results show that the proposed approaches in this paper have good performance and scalability.
Highlights
With the development of data acquisition technology and data acquisition equipment, data size, data precision and data dimension are increasing rapidly in an unprecedented way.The dimensions of many types of data can reach thousands or ten thousands of dimensions, such as image, video, trajectory, time series and so on
High-dimensional data similarity join can figure out all the similar data pairs whose distance is not bigger than the predefined distance threshold from the massive high-dimensional data set, which plays an important role in many fields, such as image clustering, document de-duplication, similarity video detection, etc
We try to deal with the similarity join problem on massive high-dimensional data by using MapReduce framework and provide the following contributions:
Summary
With the development of data acquisition technology and data acquisition equipment, data size, data precision and data dimension are increasing rapidly in an unprecedented way.The dimensions of many types of data can reach thousands or ten thousands of dimensions, such as image, video, trajectory, time series and so on. Three novel approaches based on projection scheme were proposed which can deal with massive highdimensional data similarity join problem efficiently: Single projection method, Multiple projection method and Projection space partitioning method. Y. Ma et al.: Projection Based Large Scale High-Dimensional Data Similarity Join Using MapReduce Framework join, and analyzes their advantages and disadvantages; The third section gives the definition of high-dimensional data similarity join, introduces some relevant basic knowledge and proves relevant theorems; Three novel similarity join algorithms are proposed respectively, 5 and 6 which are Single Projection based Similarity Join Algorithm Using MapReduce, Multiple Projections based Similarity Join Algorithm Using MapReduce, Projection Space Partitioning based Similarity Join Algorithm Using MapReduce; Section 7 conducts comprehensive experiments; In Section 8, some conclusions and expectations about the work are made Ma et al.: Projection Based Large Scale High-Dimensional Data Similarity Join Using MapReduce Framework join, and analyzes their advantages and disadvantages; The third section gives the definition of high-dimensional data similarity join, introduces some relevant basic knowledge and proves relevant theorems; Three novel similarity join algorithms are proposed respectively in section 4, 5 and 6 which are Single Projection based Similarity Join Algorithm Using MapReduce, Multiple Projections based Similarity Join Algorithm Using MapReduce, Projection Space Partitioning based Similarity Join Algorithm Using MapReduce; Section 7 conducts comprehensive experiments; In Section 8, some conclusions and expectations about the work are made
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.