Occam’s Razor for Big Data? On Detecting Quality in Large Unstructured Datasets

Dresp-Langley Dresp-Langley,Gohshi Gohshi,Ekseth Ekseth,Sehring Sehring,Fesl Fesl,Kurz Kurz

doi:10.3390/app9153065

Abstract

Detecting quality in large unstructured datasets requires capacities far beyond the limits of human perception and communicability and, as a result, there is an emerging trend towards increasingly complex analytic solutions in data science to cope with this problem. This new trend towards analytic complexity represents a severe challenge for the principle of parsimony (Occam’s razor) in science. This review article combines insight from various domains such as physics, computational science, data engineering, and cognitive science to review the specific properties of big data. Problems for detecting data quality without losing the principle of parsimony are then highlighted on the basis of specific examples. Computational building block approaches for data clustering can help to deal with large unstructured datasets in minimized computation time, and meaning can be extracted rapidly from large sets of unstructured image or video data parsimoniously through relatively simple unsupervised machine learning algorithms. Why we still massively lack in expertise for exploiting big data wisely to extract relevant information for specific tasks, recognize patterns and generate new information, or simply store and further process large amounts of sensor data is then reviewed, and examples illustrating why we need subjective views and pragmatic methods to analyze big data contents are brought forward. The review concludes on how cultural differences between East and West are likely to affect the course of big data analytics, and the development of increasingly autonomous artificial intelligence (AI) aimed at coping with the big data deluge in the near future.

Highlights

The Cisco Global Cloud Index 2016–2021 Forecast [1] estimates that nearly 850 zeta bytes (ZB)of data will be generated by all people, machines, and things by 2021, up from the 220 ZB generated in 2016
We propose that the propose that the most prevailing analytic approaches to big data may be arbitrarily ranked into three most prevailing analytic approaches to big data may be arbitrarily ranked into three categories: (1)
It reflects an application domain. It formulates that some data represent the weight of a person and that the unit of measure of grams is attached to it and that values typically lie in the range of 2000 to 200,000. While the former aspect of technical representation is mostly covered by type systems/database schemas, the latter aspect of domain-specific interpretation is buried in constraints and in application code that make use of the data

Summary

Introduction

The Cisco Global Cloud Index 2016–2021 Forecast [1] estimates that nearly 850 zeta bytes (ZB)of data will be generated by all people, machines, and things by 2021, up from the 220 ZB generated in 2016. The Cisco Global Cloud Index 2016–2021 Forecast [1] estimates that nearly 850 zeta bytes (ZB). The Cisco forecast states further that “most of this ephemeral data is deemed not useful to save”, and that “approximately 10 percent of it is useful, which means that there will be 10 times more useful data being created (85 ZB, 10 percent of the 850 total) than will be stored or used (7.2 ZB) in 2021”. Sci. 2019, 9, 3065; doi:10.3390/app9153065 www.mdpi.com/journal/applsci business, and society. The big data issue, coupled with that of finding new data analytics, radically challenge established theory and practice across the sciences, engendering a new form of scientific uncertainty and paradigm shifts [2] in all major fields of science, from physics to the humanities. Already more than 10 years ago, Anderson [3], among other visionaries, predicted that Sci

Objectives

Methods

Findings

Conclusion