Abstract

Big Data is an essential research area for governments, institutions, and private agencies to support their analytics decisions. Big Data refers to all about data, how it is collected, processed, and analyzed to generate value-added data-driven insights and decisions. Degradation in Data Quality may result in unpredictable consequences. In this case, confidence and worthiness in the data and its source are lost. In the Big Data context, data characteristics, such as volume, multi-heterogeneous data sources, and fast data generation, increase the risk of quality degradation and require efficient mechanisms to check data worthiness. However, ensuring Big Data Quality (BDQ) is a very costly and time-consuming process, since excessive computing resources are required. Maintaining Quality through the Big Data lifecycle requires quality profiling and verification before its processing decision. A BDQ Management Framework for enhancing the pre-processing activities while strengthening data control is proposed. The proposed framework uses a new concept called Big Data Quality Profile. This concept captures quality outline, requirements, attributes, dimensions, scores, and rules. Using Big Data profiling and sampling components of the framework, a faster and efficient data quality estimation is initiated before and after an intermediate pre-processing phase. The exploratory profiling component of the framework plays an initial role in quality profiling; it uses a set of predefined quality metrics to evaluate important data quality dimensions. It generates quality rules by applying various pre-processing activities and their related functions. These rules mainly aim at the Data Quality Profile and result in quality scores for the selected quality attributes. The framework implementation and dataflow management across various quality management processes have been discussed, further some ongoing work on framework evaluation and deployment to support quality evaluation decisions conclude the paper.

Highlights

  • Big Data is universal [1], it consists of large volumes of data, with unconventional types

  • Data quality profile (DQP) and repository (DQPREPO) We describe hereafter the content of Data Quality Profile (DQP) and the DQP repository and the DQP levels captured through the lifecycle of framework processes

  • B) Quality selection: It consists of a selection of an appropriate quality metric to evaluate data quality dimensions for an attribute of a Big Data sample set and returns a count of correct values, which comply with the metric formula

Read more

Summary

Introduction

Big Data is universal [1], it consists of large volumes of data, with unconventional types. This will help the user to obtain an overview of some DQDs and make better attributes selection based on this first quality approximation with a ready-to-use list of rules for pre-processing. This includes, for example, quality requirements, DQES, DQD scores, data quality rules, Pre-Processing activities, activity functions, DQD metrics, and Data Profiles.

Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.