Data Quality Tools for Data Warehousing: Enterprise Case Study

Majid Zaman,Muheet Ahmedbutt

doi:10.5120/10117-4788

Abstract

Ensuring Data Quality for an enterprise data repository various data quality tools are used that focus on this issue. The scope of these tools is moving from specific applications to a more global perspective so as to ensure data quality at every level. A more organized framework is needed to help managers to choose these tools so that that the data repositories or data warehouses could be maintained in a very efficient way. Data quality tools are used in data warehousing to ready the data and ensure that clean data populates the warehouse, thus enhancing usability of the warehouse. This research focuses on the on the various data quality tools which have been used and implemented successfully in the preparation of examination data of University of Kashmir for the preparation of results. This paper also proposes the mapping of data quality tools with the process which are involved for efficient data migration to data warehouse. I. INTRODUCTION Data quality has two distinct aspects: one is the correctness of data (such as accuracy and consistency), and the other involves the appropriateness of data for some intended purposes. Data producers and users generally assume that the purpose of data quality assurance is to provide the best data possible. However, this obscures the need to evaluate data. The implication is that if a data set is the best available and is as good as it can be made, and then there are no other options than to use it. In this case, there is no point in worrying about just how good it can be made. The flaw in this is that merely saying that a data set is as good as it can be made does not tell us how good it is or whether it is any good at all. What may be considered good data in one case may not be sufficient in another case. Data warehousing is now considered as the foundation of an enterprise information infrastructure. It is the repository where data of the enterprise is stored. It is imperative that the issue of data quality be addressed if the data warehouse is to prove beneficial to an enterprise. Corporations, government agencies (public or private) and not-for-profit groups are all flooded with enormous amounts of data. The desire to use this data as a resource for the enterprise has increased the move towards data warehouses. This information has the potential to be used by an enterprise to generate smarter and efficient understanding of their customers, processes, and the enterprise itself. There combining of data with other sources prospectivestep towards increasing of the usefulness of the utilization of information in a proper manner. But, if the underlying data is not accurate, any relationships found in the data warehouse will be obviously misleading. For example, most student registration system requires a Registration Number of the student when setting up student information. If no or invalid number is available an invalid or no output is generated. If the student registration numbers are not changed, then some relationship may exist in the database, but the relationship would be misleading because the underlying data is inaccurate. The steps for building a data warehouse or repository are well understood. The data flows from one or more source databases into an intermediate staging area, and finally into the data warehouse or repository. At each stage there are data quality tools available to massage, clean and transform the data, thus enhancing the usability of the data once it resides in the data warehouse which could be easily mined at later stage.The proposed research tries to address the various issues regarding the association between data quality tools and the data enabled processes so that quality data resides in a Data Warehouse.

Full Text