Abstract

Record Linkage (RL) is the task of identifying duplicate entities in a dataset or multiple datasets. In the era of Big Data, this task has gained notorious attention due to the intrinsic quadratic complexity of the problem in relation to the size of the dataset. In practice, this task can be outsourced to a cloud service, and thus, a service customer may be interested in estimating the costs of a record linkage solution before executing it. Since the execution time of a record linkage solution depends on a combination of various algorithms, their respective parameter values and the employed cloud infrastructure, in practice it is hard to perform an a priori estimation of infrastructure costs for executing a record linkage task. Besides estimating customer costs, the estimation of record linkage costs is also important to evaluate whether (or not) the application of a set of RL parameter values will satisfy predefined time and budget restrictions. Aiming to tackle these challenges, we propose a theoretical model for estimating RL costs taking into account the main steps that may influence the execution time of the RL task. We also propose an algorithm, denoted as TBF, for evaluating the feasibility of RL parameter values, given a set of predefined customer restrictions. We evaluate the efficacy of the proposed model combined with regression techniques using record linkage results processed in real distributed environments. Based on the experimental results, we show that the employed regression technique has significant influence over the estimated record linkage costs. Moreover, we conclude that specific regression techniques are more suitable for estimating record linkage costs, depending on the evaluated scenario.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call