Abstract

Detecting duplicate records is an essential step in data preparation for producing high-quality data for analysis. In statistics, it is commonly referred to as record linkage, but it goes by other names depending on the field of study. This study attempted to conduct duplicate detection on RLdata500 data using unsupervised Random Forests. In general, unsupervised machine learning algorithms are inferior to supervised learning algorithms. However, supervised learning requires labelled training data, which is absent from the majority of real-world datasets. Therefore, unsupervised learning is still applicable, and the performance of unsupervised learning algorithms can be enhanced through proper data conditioning. This paper reports the outcome of such conditioning by using bins and string encoding (Onca and Soundex). The similarity weight produced by the pairwise proximity matrix was used as the matching measure, and an optimal threshold was decided to be the cut-off point for classifying the record pairs as a match (duplicate detected) or non-match. Since the duplicate records in this dataset are known, the performances of the conditions can be compared using recall, precision, and F-measure, and the optimal threshold was determined using the maximum F -measure. Additional precision-recall graphs, as well as an F -measure graph that was transformed for a fair comparison, were used. The results then determined that the optimal conditions (2, 6 and 7) were the ones where the number values were treated as continuous. The performance of the employed encoding methods was equal when applied to the dataset and, together with binning, those conditions were able to detect more perfect matches. In the future, more testing and comparing will be done with different datasets and methods.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.