Abstract

We present a dataflow model for modelling parallel Unix shell pipelines. To accurately capture the semantics of complex Unix pipelines, the dataflow model is order-aware, i.e., the order in which a node in the dataflow graph consumes inputs from different edges plays a central role in the semantics of the computation and therefore in the resulting parallelization. We use this model to capture the semantics of transformations that exploit data parallelism available in Unix shell computations and prove their correctness. We additionally formalize the translations from the Unix shell to the dataflow model and from the dataflow model back to a parallel shell script. We implement our model and transformations as the compiler and optimization passes of a system parallelizing shell pipelines, and use it to evaluate the speedup achieved on 47 pipelines.

Highlights

  • Unix pipelines are an attractive choice for specifying succinct and simple programs for data processing, system orchestration, and other automation tasks [McIlroy et al 1978]

  • In contrast to standard dataflow models [Kahn 1974; Kahn and MacQueen 1977; Karp and Miller 1966; Lee and Messerschmitt 1987a,b], our dataflow model is order-awareÐi.e., the order in which a node in the dataflow graph consumes inputs from different edges plays a central role in the semantics of the computation and in the resulting parallelization

  • We presented an order-aware dataflow model for exploiting data parallelism latent in Unix shell scripts

Read more

Summary

Introduction

Unix pipelines are an attractive choice for specifying succinct and simple programs for data processing, system orchestration, and other automation tasks [McIlroy et al 1978]. The first command streams two markdown files into a pipeline that converts characters in the stream into lower case, removes punctuation, sorts the stream in alphabetical order, removes duplicate words, and filters out words from a dictionary file (lines 1 and 2, up to ł;ž). Unix Streams: A key Unix abstraction is the data stream, operated upon by executing commands or processes. Streams are sequences of bytes, but most commands process them as higher-level sequences of line elements, with the newline character delimiting each element and the EOF condition representing the end of a stream. The sequence order is maintained when changing between persistent files and ephemeral streams

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call