Abstract

In this paper, we introduce AttendEM, a framework for entity matching (EM), i.e., pairwise identification of duplicates across databases. Eschewing the prevalent focus on text cleaning and training data augmentation of other transformers-based EM solutions, AttendEM leverages intra-transformer ensembling of distinctively rearranged text, additional aggregator tokens, and extra self-attention to enhance the base transformer architecture. Against state-of-the-art (SOTA) solutions on the ER-Magellan benchmark datasets, AttendEM achieved higher F1 scores in most cases. These SOTA solutions are Ditto (mean improvement of 0.21% with Ditto’s own reported results, 3.93% with DAEM’s Ditto replication, 2.99% with HierGAT’s Ditto replication), DAEM (0.53%), and HierGAT (0.54%). AttendEM’s improvements are comparable to solutions that claimed to have outperformed Ditto, HierGAT (Yao et al., 2022) (2.46% compared to AttendEM’s 2.99%) and DAEM (Huang et al., 2022) (3.42% compared to AttendEM’s 3.93%), when calculated using results from their respective Ditto replications.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call