Abstract

Repairing obsolete data items to the up-to-date values faces great challenges in the area of improving data quality. Previous methods of data repairing are based on either quality rules or statistical techniques, but both of the two types of methods have their limitations. To overcome the shortages of the previous methods, this paper focuses on combining quality rules and statistical techniques to improve data currency. (1) A new class of currency repairing rules (CRR for short) is proposed to express both domain knowledge and statistical information. Domain knowledge is expressed by the rule pattern, and the statistical information is described by the conditional probability distribution corresponding to each rule. (2) The problem of generating minimized CRRs is studied in both static and dynamic world. In the static world, the problem of generating minimized CRR patterns is proved to be NP-hard, and two approximate algorithms are provided to solve the problem. In dynamic world, methods are provided to update the CRRs without recomputing the whole CRR set in case of data being changed. In some special cases, the updates can be finished in $$O(1)$$O(1) time. In both cases, the methods for learning conditional probabilities for each CRR pattern are provided. (3) Based on the CRRs, the problems of finding optimal repairing plans with and without cost budget is studied, and methods are provided to solve them. (4) The experiments based on both real and synthetic data sets show that the proposed methods are efficient and effective.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.