Today, there is a growing need for organizations to continuously analyze and process large waves of incoming data from the Internet. Such data processing schemes are often governed by complex dataflow systems, which are deployed atop highly-scalable infrastructures that need to manage data efficiently in order to enhance performance and alleviate costs.Current workflow management systems enforce strict temporal synchronization among the various processing steps; however, this is not the most desirable functioning in a large number of scenarios. For example, considering dataflows that continuously analyze data upon the insertion/update of new entries in a data store, it would be wise to assess the level of modifications in data, before the trigger of the dataflow, that would minimize the number of executions (processing steps), reducing overhead and augmenting performance, while maintaining the dataflow processing results within certain coverage and freshness limit.Towards this end, we introduce the notion of Quality-of-Data (QoD), which describes the level of modifications necessary on a data store to trigger processing steps, and thus conveying in the level of performance specified through data requirements. Also, this notion can be specially beneficial in cloud computing, where a dataflow computing service (SaaS) may provide certain QoD levels for different budgets.In this article we propose F l u χ, a novel dataflow model, with framework and programming library support, for orchestrating data-based processing steps, over a NoSQL data store, whose triggering is based on the evaluation and dynamic enforcement of QoD constraints that are defined (and possibly adjusted automatically) for different sets of data. With F l u χ we demonstrate how dataflows can be leveraged to respond to quality boundaries that bring controlled and augmented performance, rationalization of resources, and task prioritization.
Read full abstract