Abstract
Linking records is essential in data integration, healthcare analysis, fraud detection, and other applications where matching across datasets is needed. But actual data is usually noisy (lost values, typos, inconsistent formatting), and these factors greatly sour the performance of deterministic and probabilistic approaches. In this paper, we introduce a deep learning model and high-level regularizations (dropout, weight decay, early stopping) to enhance robustness for noisy record linkage. We test the approaches by using open data, that are simulated scenarios of real world with different levels of noise. Data augmentation generates fake noise (realistic input errors). Results reveal that regularization techniques improve the models performance under noisy environments with up to 20% better accuracy and recall than unregularized models. Dropout specifically tended to generalise better by limiting overfitting to noise. These results reveal the potential of deep learning and regularization to address record linkage problems in noisy environments, and suggest future work on additional techniques including adversarial training and batch normalization.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have