Currently, there is a strong need for organizations to analyze and process ever-increasing volumes of data in order to answer to real-time processing demands. Such continuous and data-intensive processing is often achieved through the composition of complex data-intensive workflows (i.e., dataflows).Dataflow management systems typically enforce strict temporal synchronization across the various processing steps. Non-synchronous behavior often has to be explicitly programmed on an ad-hoc basis, which requires additional lines of code in programs and thus the possibility of errors. More so, in a large set of scenarios for continuous and incremental processing, the output of dataflow applications at each execution can suffer almost no difference when comparing to the previous execution, and therefore resources, energy and computational power are unknowingly wasted.To face such lack of efficiency, transparency, and generality, we introduce the notion of Quality-of-Data (QoD), which describes the level of changes required on a data store that cause the triggering of processing steps. This, so that the dataflow (re-)execution is reduced until its outcome would reach a significant and meaningful variation, which is inside a specified freshness limit.Based on the QoD notion, we propose a novel dataflow model, with framework (Fluxy), for orchestrating data-intensive processing steps, which communicate data via a NoSQL storage, and whose triggering semantics is driven by dynamic QoD constraints automatically defined for different datasets by means of Fuzzy Boolean Nets. These nets give probabilistic guarantees about the prediction of the cumulative error between consecutive dataflow executions. With Fluxy, we demonstrate how dataflows can be leveraged to respond to quality boundaries (that can be seen as SLAs) to deliver controlled and augmented performance, rationalization of resources, and task prioritization.
Read full abstract