Abstract

Entity resolution, accurately identifying various representations of the same real-world entities, is a crucial part of data integration systems. While existing learning-based models can achieve good performance, the models are extremely dependent on the quantity and quality of training data. In this paper, the MixER model is proposed to alleviate these problems. The MixER utilizes our newly designed data augmentation method called EMix. The EMix can map discrete entity records to continuous latent space variables (e.g., probability distributions) and then linearly interpolate entity records in latent space to generate many augmented training samples. The matching model is further optimized based on the augmented data to strengthen its generalization capability. The MixER model achieves significant strengths in the data sensitivity experiments when training data is below 50. In robustness experiments, the MixER model presents an absolute performance advantage when the label noise exceeds 20%. In addition, ablation experiments demonstrate that the newly developed EMix can effectively improve the generalization ability of the matching model. The overall experimental results prove that the MixER model exhibited excellent data sensitivity and robustness over the current state-of-the-art methods.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call