Abstract

Record linkage is a challenging task for Big Data. This paper, hence, attempts to shed light on record linkage approaches for Big Data by comparing three dimensions involving record linkage phases, dataset properties, and parallel processing approach for Big Data. The current state of art have only conducted comparative studies between record linkage approaches. There has been only one comparative study exploring the whole record linkage framework of the relational database. It is believed that the focus of the present study on the dimensions of the parallel processing approaches for Big Data and dataset properties was worth exploring. It was found that first, data exploration was almost a non-existing phase despite its importance of exploring the dataset being examined; second, techniques that handle data standardization and preparation phase of the first dimension were not extensively covered in the literature which can directly affect the results’ quality; third, the record linkage in unstructured data was not yet explored in literature; fourth, the MapReduce was used in about 50 % of the selected studies to handle the parallel processing of Big Data, but due to its limitations, more recent and efficient approaches had been used, such as Apache Spark and Apache Flink. Apache Spark is just recently adapted to resolve duplicates due to its supporting of in-memory computation, which makes the whole linkage process more efficient. Although the comparative study includes many recent studies supporting Apache Spark, adopting Apache Spark to solve the problem of record linkage is not yet well explored in literature, as more researches need to be conducted. In addition, Apache Flink is still rarely used to solve the record linkage problem of Big Data. Fifth, pruning techniques, used to eliminate unnecessary comparisons, are not adequately applied in the covered studies despite their effect on reducing the search space resulting in a more effective Record Linkage process.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call