Abstract

We consider the problem of similarity search over the large datasets in the distributed environment. The proposed framework employs the Vp-Tree algorithm that integrated on top of the MapReduce framework to achieve good performance as well as meet the scalability and fault tolerance requirements for the system while data scale up. Since VP-Tree algorithm was implemented initially for partition and searching data in the local disk access, we proposed a new approach to using it in the parallel environment. The key point of the Vp-Tree algorithm is that it distributed the similar data points into groups, thereby reducing number of data need to scan during the searching stage. Consequently, the response time of the entire system has been improved. Otherwise, we used an open source computer vision library OpenCV for detect the similarity among images in the dataset. We evaluate the performance of our proposed framework using a synthetic data to show the positive of our approach. The experiment shows that our proposed framework achieves 57% improvement in response time in comparison with running searching job in tradition Hadoop framework. We also compared our application running time on Docker container against VM-based environment. The result points out that deploy our system over Docker container provide higher performance than VM-based environment in term of response time.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call