Abstract

AWS Glue is Amazon's serverless data integration cloud service that makes it simple and cost effective to extract, clean, enrich, load, and organize data. Originally launched in August 2017, AWS Glue began as an extract-transform-load (ETL) service designed to relieve developers and data engineers of the undifferentiated heavy lifting needed to load databases, data warehouses, and build data lakes on Amazon S3. Since then, it has evolved to serve a larger audience including ETL specialists and data scientists, and includes a broader suite of data integration capabilities. Today, hundreds of thousands of customers use AWS Glue every month. In this paper, we describe the use cases and challenges cloud customers face in preparing data for analytics and the tenets we chose to drive Glue's design. We chose early on to focus on ease-of-use, scale, and extensibility. At its core, Glue offers serverless Apache Spark and Python engines backed by a purpose-built resource manager for fast startup and auto-scaling. In Spark, it offers a new data structure --- DynamicFrames --- for manipulating messy schema-free semi-structured data such as event logs, a variety of transformations and tooling to simplify data preparation, and a new shuffle plugin to offload to cloud storage. It also includes a Hivemetastore compatible Data Catalog with Glue crawlers to build and manage metadata, e.g. for data lakes on Amazon S3. Finally, Glue Studio is its visual interface for authoring Spark and Python-based ETL jobs. We describe the innovations that differentiate AWS Glue and drive its popularity and how it has evolved over the years.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.