Abstract

This work presents an event-driven Extract, Transform, and Load (ETL) pipeline serverless architecture and provides an evaluation of its performance over a range of dataflow tasks of varying frequency, velocity, and payload size. We design an experiment while using generated tabular data throughout varying data volumes, event frequencies, and processing power in order to measure: (i) the consistency of pipeline executions; (ii) reliability on data delivery; (iii) maximum payload size per pipeline; and, (iv) economic scalability (cost of chargeable tasks). We run 92 parameterised experiments on a simple AWS architecture, thus avoiding any AWS-enhanced platform features, in order to allow for unbiased assessment of our model’s performance. Our results indicate that our reference architecture can achieve time-consistent data processing of event payloads of more than 100 MB, with a throughput of 750 KB/s across four event frequencies. It is also observed that, although the utilisation of an SQS queue for data transfer enables easy concurrency control and data slicing, it becomes a bottleneck on large sized event payloads. Finally, we develop and discuss a candidate pricing model for our reference architecture usage.

Highlights

  • Efficient, scalable, and cost-effective data processing and pipelining have become critically important in real-time analytics for decision making

  • We extend the state-of-the-art in this field of research by adding our proposed to architecture to the set of practical serverless systems for general

  • We developed and presented a reference architecture for building an event-driven ETL pipeline on top of AWS while using entirely serverless technologies, enabling a pay-per-usage model

Read more

Summary

Introduction

Scalable, and cost-effective data processing and pipelining have become critically important in real-time analytics for decision making. The paradigm favours real time event-driven solutions over periodic batch processing Systematic techniques for these tasks have been thoroughly investigated and many open source tools and frameworks emerged and adopted by industry [1,2,3,4]. Event-driven ETLs offer an alternative approach, removing the need for fixed interval runs by operating in a more reactive manner, by allowing changes in the data source to trigger data processing This approach features real time feedback, the efficient utilization of resources and elasticity [5], and it is often more desirable with respect to business requirements. Serverless computing is a relatively recent and increasingly popular evolution of cloud computing technology It aims to provide a new programming model that fully abstracts away the infrastructure layer for developers [11]. The scope of this work is not to formally compare this system with the traditional

Objectives
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.