Abstract

In this paper, we propose a rule management system for data cleaning that is based on knowledge. This system combines features of both rule based systems and rule based data cleaning frameworks. The important advantages of our system are threefold. First, it aims at proposing a strong and unified rule form based on first order structure that permits the representation and management of all the types of rules and their quality via some characteristics. Second, it leads to increase the quality of rules which conditions the quality of data cleaning. Third, it uses an appropriate knowledge acquisition process, which is the weakest task in the current rule and knowledge based systems. As several research works have shown that data cleaning is rather driven by domain knowledge than by data, we have identified and analyzed the properties that distinguish knowledge and rules from data for better determining the most components of the proposed system. In order to illustrate our system, we also present a first experiment with a case study at health sector where we demonstrate how the system is useful for the improvement of data quality. The autonomy, extensibility and platform-independency of the proposed rule management system facilitate its incorporation in any system that is interested in data quality management.

Highlights

  • Data quality (DQ) has always been an important issue and is even more the case today

  • The RDBC approaches proposed for both academic research and practical applications have certain persistent limitations related to the following aspects of rule design: No practical methodology for Rule Based Systems (RBS) is acceptable for Rule-Based approaches for DC (RBDC) systems because these methodologies are available only for rule production and don’t ensuring the quality of rule

  • The development of an appropriate RBS for Data Cleaning (DC) is a crucial issue for the final success of RBDC where the rule representation should be of satisfactory expressive power in order to express all of the required rules and be easy to handle and manage rule and its quality

Read more

Summary

Introduction

Data quality (DQ) has always been an important issue and is even more the case today. The research works look at the role of Data Cleaning (DC) tools in helping improve DQ and clarify the need to take an enterprisewide approach to DQ management, which is increasingly complex, open and dynamic [1,2]. There is a wide variety of DC tools Their functionality can be classified as follows: Declarative DC and Rule-Based approaches for DC (RBDC). The Rule Based Systems (RBS) that encode knowledge as rules and used to process complicated tasks have been firmly established for many years, they have not been well formally and adequately addressed for the DC tasks. As our objective is to enhance the DQ by applying a Rule based approach in DC, it is necessary to represent some works related to Knowledge Based System, Rule Based System and Rules-Based Data Cleaning. Knowledge could be obtained from domain experts, raw data, documents, personal knowledge, business models and/or learning by experience [12,13].

Objectives
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.