Abstract

With the rapid development of industry, every day there are huge amounts of industrial data in this process, there will be part of the data is duplicated, and it not only reduces the data quality in a certain extent, but also affects the enterprise to make a right decision, thereby reducing The productivity. In order to improve the quality of data, it's particularly important to clean the similar duplicate records. However, when SNM algorithm is used to detect similar records, we need to compare all the records in the window, and the time efficiency and accuracy are not high. Aiming at this defect, an improved dynamic fault-tolerant algorithm based on effective weight is proposed in this paper. Firstly, in the window according to the proportion of the length of two records will not be a duplicate record data excluded, reduce the times of comparing records, so as to improve the detection efficiency; Secondly, by setting the validity of the property factor, for the detection process Due to the miscarriages caused by missing attributes, a dynamic fault tolerance algorithm is proposed, which not only improves the efficiency of checking the weight but also ensures the accuracy of similar duplicate records detection. Finally, the experimental results show that, under the same experimental environment, the improved algorithm has obvious advantages both in terms of time efficiency and accuracy. Finally, the experimental results show that, in the same experimental environment, the improved algorithm has obvious advantages both in time efficiency and accuracy.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.