Partition Aware Duplicate Records Detection (PADRD) Methodology in Big Data - Decision Support Systems

Anusuya Kirubakaran,Aramudhan Murugaiyan

doi:10.1007/978-981-10-8603-8_8

Abstract

As on today, the big data analytics and business intelligence (BI) decision support system (DSS) are the vital pillar of the leadership ability by translating raw data toward intelligence to make ‘right decision on right time’ and to share ‘right decision to right people’. Often DSS challenged to process the massive volume of data (Terabyte, petabyte, Exabyte, Zettabyte etc.) and to overcome the issues like data quality, scalability, storage and query performance. The failure in DSS was one of the reasons highlighted clearly by United State Senate report regarding the 2008 American economy collapse. To keep these issues in mind, this work explores a preventive methodology for “Data Quality - Duplicates” dimension with optimized query performance in big data era. In detail, BI team extracts and loads the historical operational structured data (Data Feed) to its repository from multiple sources periodically such as daily, weekly, monthly, quarterly, half yearly for analytics and reporting. During this load unpremeditated duplicate data feed insertion occurs due to lack of expertise, lack of history, missing integrity constraints which impact the intelligence reporting error ratio & the leader ship ability. So the necessity of unintentional data quality issue injection prevention arises. Over all, this paper proposes a methodology to “Improve the Data Accuracy” through detection of duplicate records between big data repository vs data feed before the data load with “Optimized Query Performance” through partition aware search query generation and “Faster Data Block Address Search” through braided b+ tree indexing.

Full Text