Abstract
Nowadays, data integration must often manage noisy data, also containing attribute values written in natural language such as product descriptions or book reviews. In the data integration process, Entity Linkage has the role of identifying records that contain information referring to the same object. Modern Entity Linkage methods, in order to reduce the dimension of the problem, partition the initial search space into “blocks” of records that can be considered similar according to some metrics, comparing then only the records belonging to the same block and thus greatly reducing the overall complexity of the algorithm. In this paper, we propose two automatic blocking strategies that, differently from the traditional methods, aim at capturing the semantic properties of data by means of recent deep learning frameworks. Both methods, in a first phase, exploit recent research on tuple and sentence embeddings to transform the database records into real-valued vectors; in a second phase, to arrange the tuples inside the blocks, one of them adopts approximate nearest neighbourhood algorithms, while the other one uses dimensionality reduction techniques combined with clustering algorithms. We train our blocking models on an external, independent corpus, and then, we directly apply them to new datasets in an unsupervised fashion. Our choice is motivated by the fact that, in most data integration scenarios, no training data are actually available. We tested our systems on six popular datasets and compared their performances against five traditional blocking algorithms. The test results demonstrated that our deep-learning-based blocking solutions outperform standard blocking algorithms, especially on textual and noisy data.
Highlights
The integration of data coming from different sources is today of paramount importance: companies, hospitals, government agencies, banks and many other actors, in order to carry out their everyday activities, need to merge several datasets, e.g. customers databases or patient and pathology records.Technopole, Milan, ItalyIntegrating data in these scenarios may be relatively simple, especially when the data sources have clean and standard attributes, but with the increased use of internet-based services like e-commerce, web sites for comparing products or online libraries, data integration is becoming more challenging
We present the two dimensionality reduction techniques used in our methodology: principal component analysis (PCA) and t-distributed stochastic neighbour embedding (t-SNE)
We first provide the results of the tests between our blocking systems and the traditional methods, and we investigate the impact of different architectural choices of our model
Summary
The integration of data coming from different sources is today of paramount importance: companies, hospitals, government agencies, banks and many other actors, in order to carry out their everyday activities, need to merge several datasets, e.g. customers databases or patient and pathology records. Integrating data in these scenarios may be relatively simple, especially when the data sources have clean and standard attributes, but with the increased use of internet-based services like e-commerce, web sites for comparing products or online libraries, data integration is becoming more challenging. The current disruptive growth in dataset sizes makes the problem intractable, since, when the number and the sizes of the datasets
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.