Abstract

Very large data sets - telephone call records, network logs, high-resolution satellite images, or web document repositories - are not easily analyzed using traditional database techniques. They may be simply too large, grow too fast, or may not fit well in a database schema. They tend to span multiple disks and machines. On the other hand, these large data sets often have a flat and regular structure that permits distributed filtering and aggregation.We present a system and language for such analyses*. Altering phase, in which a query is expressed using the procedural programming language Sawzall, emits data to an aggregation phase. Both phases are distributed over hundreds or even thousands of computers. The language constructs and execution model of Sawzall have been devised to enable parallel execution without the need for complex dependency analysis. Even with our fairly traditional implementation of the Sawzall execution engine we observe nearly perfect scalability as we add more machines.*Joint work with Sean Dorward, Rob Pike, and Sean Quinlan.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.