Abstract

Distributed computing is linked and equated to the industrial revolution. Its transformational nature is, however, associated with significant instances in the form of internet of thing operations. Entity resolution (ER) is a problem of matching and resolving records that represent the same real world entity. This is a long-standing challenge in distributed databases and information retrieval as a statistic. In a centralized approach, the problem of ER has not been scaled well as large amount of data need to be sent to a central node. In this paper, we present an algorithm which deals with heterogeneous distributed probabilistic data (HDPD) and also reduces processing time in a distributed environment. We propose two different approaches. First, we explore this instance with a matching (identification) problem to integrate different data models with expectation–maximization (EM) algorithm. Second, we apply ER methodology for HDPD to achieve major performance in terms of response time to produce the outcome. We validate HDPD through experiments over a 100-node cluster that records significant performance improvements over naive approaches. This paper is expected to provide insights in to database organizations and new technological development for the growth of distributed environment.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call