Abstract

The creation of new knowledge from manipulating and analysing existing knowledge is one of the primary objectives of any cognitive system. Most of the effort on Big Data research has been focussed upon Volume and Velocity, while Variety, “the ugly duckling” of Big Data, is often neglected and difficult to solve. A principal challenge with Variety is being able to understand and comprehend the data. This paper proposes and evaluates an automated approach for metadata identification and enrichment in describing Big Data. The paper focuses on the use of self-learning systems that will enable automatic compliance of data against regulatory requirements along with the capability of generating valuable and readily usable metadata towards data classification. Two experiments towards data confidentiality and data identification were conducted in evaluating the feasibility of the approach. The focus of the experiments was to confirm that repetitive manual tasks can be automated, thus reducing the focus of a Data Scientist on data identification and thereby providing more focus towards the extraction and analysis of the data itself. The origin of the datasets used were Private/Business and Public/Governmental and exhibited diverse characteristics in relation to the number of files and size of the files. The experimental work confirmed that: (a) the use of algorithmic techniques attributed to the substantial decrease in false positives regarding the identification of confidential information; (b) evidence that the use of a fraction of a data set along with statistical analysis and supervised learning is sufficient in identifying the structure of information within it. With this approach, the issues of understanding the nature of data can be mitigated, enabling a greater focus on meaningful interpretation of the heterogeneous data.

Highlights

  • Laney, often referred to as the father of Big Data, had introduced three dimensions that characterise Big Data which have become the industry standard for defining Big Data [1, 2]

  • In order to calculate the performance against the simple Regular Expressions (RegEx), the 50% confidence level was considered as the benchmark

  • Standard RegEx would always yield values lower than 50%, whilst the “boosters” would elevate the respective to higher percentages

Read more

Summary

Introduction

Often referred to as the father of Big Data, had introduced three dimensions that characterise Big Data which have become the industry standard for defining Big Data [1, 2]. Business is concerned with income and realization of competitive advantages; as a result, Value is another dimension that poses a significant challenge. These business dimensions have become essential since Big Data started gaining growing acceptance towards the data-driven decision (DDD) making approach [5]. The second and third screening extended to the abstract, introduction and conclusions along with a complete document review, respectively.

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call