Abstract

Earth Mover's Distance (EMD) evaluates the similarity between probability distributions, known as a robust measure more consistent with human similarity perception than traditional similarity functions. EMD similarity join retrieves pairs of probability distributions with EMD below a specified threshold, supporting many important applications, such as duplicate image retrieval and sensor pattern recognition. This paper studies the possibility of using MapReduce to improve the scalability of EMD similarity join. Utilizing the dual-program mapping technique, we present a new general data partition framework to facilitate effective workload decomposition using MapReduce, ensuring similar distributions in terms of EMD are mapped to the same reduce task for further verification. New optimization strategies are also proposed to balance the workloads among reduce tasks and eliminate large unnecessary EMD evaluations. Our experiments verify the superiority of our proposal on system efficiency, with a huge advantage of at least one order of magnitude than the state-of-the-art solution, and on system effectiveness, with a real case study towards the abused image phenomenon on C2C website in China. Further details are reported in [4].

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.