Abstract

In the Big Data Era, Entity Resolution (ER) faces many challenges such as high scalability, the coexistence of complex similarity metrics, tautonymy and synonym, and the requirement of Data Quality Evaluation. Moreover, despite more than seventy years of development efforts, there is still a high demand for democratizing ER to reduce human participation in tuning parameters, data labeling, defining blocking functions, and feature engineering. This study aimed to explore a novel Stacked Dedupe Learning ER system with high accuracy and efficiency. The study evaluated sophisticated composition methods, such as Bidirectional Recurrent Neural Networks (BiRNNs) and Long Short-Term Memory (LSTM) hidden units, to renovate each tuple to word representation distribution in a sense to capture similarities amidst tuples. Also, pre-trained words embedding where they were not available, ways to learn and tune Word Representation Distribution customized for ER tasks under different scenarios were considered. More so, the Locality Sensitive Hashing (LSH) based blocking approach, which considered the entire attributes of a tuple and produced slighter blocks, compared with traditional methods with few attributes, were assessed. The algorithm was tested on multiple datasets namely benchmarks, and multi-lingual data. The experimental results showed that Stacked Dedupe Learning achieves high quality and good performance, and scales well compared to the existing solutions.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.