Abstract

In entity resolution, blocking pre-partitions data for further processing by more expensive methods. Two entity mentions are in the same block if they share identical or related blocking-keys . Previous work has sometimes related blocking keys by grouping or alphabetically sorting them, but—as was shown for author disambiguation—the respective equivalences or total orders are not necessarily well-suited to model the logical matching-relation between blocking keys. To address this, we present a novel blocking approach that exploits the subset partial order over entity representations to build a matching-based bipartite graph, using connected components as blocks. To prevent over- and underconnectedness, we allow specification of overly general and generalization of overly specific representations. To build the bipartite graph, we contribute a new parallellized algorithm with configurable time/space tradeoff for minimal element search in the subset partial order. As a job-based approach, it combines dynamic scalability and easier integration to make it more convenient than the previously described approaches. Experiments on large gold standards for publication records, author mentions, and affiliation strings suggest that our approach is competitive in performance and allows better addressing of domain-specific problems. For duplicate detection and author disambiguation, our method offers the expected performance as defined by the vector-similarity baseline used in another work on the same dataset and the common surname, first-initial baseline. For top-level institution resolution, we have reproduced the challenges described in prior work, strengthening the conclusion that for affiliation data, overlapping blocks under minimal elements are more suitable than connected components.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call