Deep entity matching with pre-trained language models

Yuliang Li,Wang-Chiew Tan,Anhai Doan,Yoshihiko Suhara,Jinfeng Li

doi:10.14778/3421424.3421431

Abstract

We present Ditto, a novel entity matching system based on pre-trained Transformer-based language models. We fine-tune and cast EM as a sequence-pair classification problem to leverage such models with a simple architecture. Our experiments show that a straight-forward application of language models such as BERT, DistilBERT, or RoBERTa pre-trained on large text corpora already significantly improves the matching quality and outperforms previous state-of-the-art (SOTA), by up to 29% of F1 score on benchmark datasets. We also developed three optimization techniques to further improve Ditto's matching capability. Ditto allows domain knowledge to be injected by highlighting important pieces of input information that may be of interest when making matching decisions. Ditto also summarizes strings that are too long so that only the essential information is retained and used for EM. Finally, Ditto adapts a SOTA technique on data augmentation for text to EM to augment the training data with (difficult) examples. This way, Ditto is forced to learn "harder" to improve the model's matching capability. The optimizations we developed further boost the performance of Ditto by up to 9.8%. Perhaps more surprisingly, we establish that Ditto can achieve the previous SOTA results with at most half the number of labeled data. Finally, we demonstrate Ditto's effectiveness on a real-world large-scale EM task. On matching two company datasets consisting of 789K and 412K records, Ditto achieves a high F1 score of 96.5%.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Deep entity matching with pre-trained language models

Abstract

Talk to us

Similar Papers

More From: Proceedings of the VLDB Endowment

Lead the way for us

Journal: Proceedings of the VLDB Endowment	Publication Date: Sep 1, 2020
Citations: 182

Similar Papers

JointMatcher: Numerically-aware entity matching using pre-trained language models with attention concentration
Chen Ye ... Guojun Dai
Knowledge-Based Systems | VOL. 251
Chen Ye, et. al.Chen Ye ... Guojun Dai
16 May 2022
Knowledge-Based Systems | VOL. 251

A Study of Vietnamese Sentiment Classification with Ensemble Pre-Trained Language Models
Dang Van Thin ... Duong Ngoc Hao
Vietnam Journal of Computer Science | VOL. 11
Dang Van Thin, et. al.Dang Van Thin ... Duong Ngoc Hao
07 Dec 2023
Vietnam Journal of Computer Science | VOL. 11

Arabic abstractive text summarization using RNN-based and transformer-based architectures
Mohammad Bani-Almarjeh ... Mohamad-Bassam Kurdy
Information Processing & Management | VOL. 60
Mohammad Bani-Almarjeh, et. al.Mohammad Bani-Almarjeh ... Mohamad-Bassam Kurdy
26 Dec 2022
Information Processing & Management | VOL. 60

Neural Transfer Learning For Vietnamese Sentiment Analysis Using Pre-trained Contextual Language Models
An Pha Le ... Tran Vu Pham
-
An Pha Le, et. al.An Pha Le ... Tran Vu Pham
16 Dec 2021
16 Dec 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Deep entity matching with pre-trained language models

Abstract

Talk to us

Similar Papers

More From: Proceedings of the VLDB Endowment