Entity Resolution (ER) has been investigated for decades in various domains as a fundamental task in data integration and data quality. The emerging volume of heterogeneously structured data and even unstructured data challenges traditional ER methods. This research mainly focuses on the Data Washing Machine (DWM). The DWM was developed in the NSF DART Data Life Cycle and Curation research theme, which helps to detect and correct certain types of data quality errors automatically. It also performs unsupervised entity resolution to identify duplicate records. However, it uses traditional methods that are driven by algorithmic pattern rules such as Levenshtein Edit Distances and Matrix comparators. The goal of this research is to assess the replacement of rule-based methods with machine learning and deep learning methods to improve the effectiveness of the processes using 18 sample datasets. The DWM has different processes to improve data quality, and we are currently focusing on working with the scoring and linking processes. To integrate the machine model into the DWM, different pre-trained models were tested to find the one that helps to produce accurate vectors that can be used to calculate the similarity between the records. After trying different pre-trained models, distilroberta was chosen to get the embeddings, and cosine similarity metrics were later used to get the similarity scores, which helped us assess the machine learning model into DWM and gave us closer results to what the scoring matrix is giving. The model performed well and gave closer results overall, and the reason can be that it helped to pick up the important features and helped at the entity matching process.
Read full abstract