Abstract

Data Profiling and data quality management become a more significant part of data engineering, which an essential part of ensuring that the system delivers quality information to users. In the last decade, data quality was considered to need more managing. Especially in the big data era that the data comes from many sources, many data types, and an enormous amount. Thus it makes the managing of data quality is more difficult and complicated. The traditional system was unable to respond as needed. The data quality managing software for big data was developed but often found in a high-priced, difficult to customize as needed, and mostly provide as GUI, which is challenging to integrate with other systems. From this problem, we have developed an opensource package for data quality managing. By using Python programming language, Which is a programming language that is widely used in the scientific and engineering field today. Because it is a programming language that is easy to read syntax, small, and has many additional packages to integrate. The software developed here is called “Sakdas” this package has been divided into three parts. The first part deals with data profiling provide a set of data analyses to generate a data profile, and this profile will help to define the data quality rules. The second part deals with data quality auditing that users can set their own data quality rules for data quality measurement. The final part deals with data visualizing that provides data profiling and data auditing report to improve the data quality. The results of the profiling and auditing services, the user can specify both the form of a report for self-review. Or in the form of JSON for use in post-process automation.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call