Distributed representations of tuples for entity resolution

Muhammad Ebraheem,Saravanan Thirumuruganathan,Mourad Ouzzani,Nan Tang,Shafiq Joty

doi:10.14778/3236187.3269461

Muhammad Ebraheem, Saravanan Thirumuruganathan + Show 3 more

https://doi.org/10.14778/3236187.3269461

Copy DOI

Abstract

Despite the efforts in 70+ years in all aspects of entity resolution (ER), there is still a high demand for democratizing ER - by reducing the heavy human involvement in labeling data, performing feature engineering, tuning parameters, and defining blocking functions. With the recent advances in deep learning, in particular distributed representations of words ( a.k.a . word embeddings), we present a novel ER system, called D eep ER, that achieves good accuracy, high efficiency, as well as ease-of-use ( i.e ., much less human efforts). We use sophisticated composition methods, namely uni- and bi-directional recurrent neural networks (RNNs) with long short term memory (LSTM) hidden units, to convert each tuple to a distributed representation ( i.e ., a vector), which can in turn be used to effectively capture similarities between tuples. We consider both the case where pre-trained word embeddings are available as well the case where they are not; we present ways to learn and tune the distributed representations that are customized for a specific ER task under different scenarios. We propose a locality sensitive hashing (LSH) based blocking approach that takes all attributes of a tuple into consideration and produces much smaller blocks, compared with traditional methods that consider only a few attributes. We evaluate our algorithms on multiple datasets (including benchmarks, biomedical data, as well as multi-lingual data) and the extensive experimental results show that D eep ER outperforms existing solutions.

Full Text