SETEM: Self-ensemble training with Pre-trained Language Models for Entity Matching

Huahua Ding,Chaofan Dai,Yahui Wu,Wubin Ma,Haohao Zhou

doi:10.1016/j.knosys.2024.111708

Abstract

Entity Matching (EM) aims to determine whether records in two datasets refer to the same real-world entity. Existing work often uses Pre-trained Language Models (PLMs) for feature representation, converting EM to a binary classification task. However, due to the dependence of PLMs on large labeled datasets and the overlap between train and test sets in current EM benchmarks, these methods often underperform in real-world scenarios (e.g., small data size, hard negative samples, and unseen entities). To address the limitations of existing methods, we propose SETEM, a self-ensemble training method leveraging the stability and strong generalization of ensemble models to tackle these challenges in real-world scenarios. Additionally, we develop a faster training method for low-resource applications. Experiments on benchmark datasets show that SETEM outperforms Ditto and HierGAT on the F1 score. In particular, SETEM shows the greatest advantage with small datasets and a high proportion of unseen entities in the test set, achieving up to a 9.61% F1 score increment over baselines on the WDC dataset.

Full Text