SECURING BIG DATA PROVENANCE FOR AUDITORS: THE BIG DATA PROVENANCE BLACK BOX

Deniz Appelbaum

doi:10.5748/9788599693117-12contecsi/ps-2933

Abstract

Companies are experiencing a literal deluge of data, from many different sources and in a wide variety of forms that is being generated very quickly. In the area of data provenance, or lineage and veracity of the data set, Big Data exposes the limitations of current technology in providing adequate assurance for auditing purposes. The provenance of the data may be doubtful, the ownership of the data may be in question and the classification and/or identification of the information may not be possible until after analysis. Big Data involves accessing and analyzing large amounts of data that may have originated outside the firm and whose origin is not clear and custody not guaranteed. The data integrity is also in question. Furthermore, many data reduction and scrubbing techniques are then applied to the raw data sets, providing scalable processing but not assuring that the data has not been altered. Basically, with Big Data the information is not managed and tracked through its entire life cycle to ensure its confidentiality, availability, and integrity. Not only should the provenance track the origins and iterations of the data, but the provenance records themselves should be securely stored and immutable to change. Without this security assurance, data provenance tracking is pointless for auditors. The data provenance is not reliable for audit evidence if it has not been stored securely. To date, the literature on Big Data provenance has not discussed the security of this provenance information. This conceptual paper examines the technical mechanisms discussed in the literature to provide provenance for Big Data in Hadoop and MapReduce, and poses a theoretical framework by which these processes can provide secure storage: The Big Data Provenance Black Box.

Full Text