Pre-trained models for linking process in data washing machine

Bushra Sajid,Ahmed Abu-Halimeh,Nuh Jakoet

doi:10.59400/cai.v3i1.1450

Abstract

Entity Resolution (ER) has been investigated for decades in various domains as a fundamental task in data integration and data quality. The emerging volume of heterogeneously structured data and even unstructured data challenges traditional ER methods. This research mainly focuses on the Data Washing Machine (DWM). The DWM was developed in the NSF DART Data Life Cycle and Curation research theme, which helps to detect and correct certain types of data quality errors automatically. It also performs unsupervised entity resolution to identify duplicate records. However, it uses traditional methods that are driven by algorithmic pattern rules such as Levenshtein Edit Distances and Matrix comparators. The goal of this research is to assess the replacement of rule-based methods with machine learning and deep learning methods to improve the effectiveness of the processes using 18 sample datasets. The DWM has different processes to improve data quality, and we are currently focusing on working with the scoring and linking processes. To integrate the machine model into the DWM, different pre-trained models were tested to find the one that helps to produce accurate vectors that can be used to calculate the similarity between the records. After trying different pre-trained models, distilroberta was chosen to get the embeddings, and cosine similarity metrics were later used to get the similarity scores, which helped us assess the machine learning model into DWM and gave us closer results to what the scoring matrix is giving. The model performed well and gave closer results overall, and the reason can be that it helped to pick up the important features and helped at the entity matching process.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Pre-trained models for linking process in data washing machine

Abstract

Talk to us

Similar Papers

More From: Computing and Artificial Intelligence

Lead the way for us

Journal: Computing and Artificial Intelligence	Publication Date: Nov 1, 2024
License type: CC BY 4.0

Similar Papers

Data Curation and Quality Evaluation for Machine Learning-Based Cyber Intrusion Detection
Ngan Tran ... Haihua Chen
IEEE Access | VOL. 10
Ngan Tran, et. al.Ngan Tran ... Haihua Chen
01 Jan 2021
IEEE Access | VOL. 10

Automatic generation of conclusions from neuroradiology MRI reports through natural language processing.
Pilar López-Úbeda ... Jorge Escartín
Neuroradiology | VOL. 66
Pilar López-Úbeda, et. al.Pilar López-Úbeda ... Jorge Escartín
21 Feb 2024
Neuroradiology | VOL. 66

A Survey on Blocking Technology of Entity Resolution
Bo-Han Li ... Shuo Wan
Journal of Computer Science and Technology | VOL. 35
Bo-Han Li, et. al.Bo-Han Li ... Shuo Wan
01 Jul 2020
Journal of Computer Science and Technology | VOL. 35

Cognitive decline assessment using semantic linguistic content and transformer deep learning architecture.
Rini Pl ... Gayathri Ks
International Journal of Language & Communication Disorders | VOL. 59
Rini Pl, et. al.Rini Pl ... Gayathri Ks
16 Nov 2023
International Journal of Language & Communication Disorders | VOL. 59

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Pre-trained models for linking process in data washing machine

Abstract

Talk to us

Similar Papers

More From: Computing and Artificial Intelligence