Abstract

In big data sources, real-world entities are typically represented with a variety of schemata and formats (e.g., relational records, JSON objects, etc.). Different profiles (i.e., representations) of an entity often contain redundant and/or inconsistent information. Thus identifying which profiles refer to the same entity is a fundamental task (called Entity Resolution) to unleash the value of big data. The naïve all-pairs comparison solution is impractical on large data, hence blocking methods are employed to partition a profile collection into (possibly overlapping) blocks and limit the comparisons to profiles that appear in the same block together. Meta-blocking is the task of restructuring a block collection, removing superfluous comparisons. Existing meta-blocking approaches rely exclusively on schema-agnostic features, under the assumption that handling the schema variety of big data does not pay-off for such a task. In this paper, we demonstrate how “loose” schema information (i.e., statistics collected directly from the data) can be exploited to enhance the quality of the blocks in a holistic loosely schema-aware (meta-)blocking approach that can be used to speed up your favorite Entity Resolution algorithm. We call it Blast (Blocking with Loosely-Aware Schema Techniques). We show how Blast can automatically extract the loose schema information by adopting an LSH-based step for efficiently handling volume and schema heterogeneity of the data. Furthermore, we introduce a novel meta-blocking algorithm that can be employed to efficiently execute Blast on MapReduce-like systems (such as Apache Spark). Finally, we experimentally demonstrate, on real-world datasets, how Blast outperforms the state-of-the-art (meta-)blocking approaches.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.