Abstract

Saving nature using Big Data Analytics is a very noble goal. Using New York taxi rides data, we decided to learn how many rides could be consolidated. It was a journey we would like to share.First, we had to choose the platform for calculation between Amazon Athena, Serverless Microservices, SQL or NoSql databases, Hadoop and Spark. Then, we had to find an optimal solution for the platform using assorted tuning and optimization techniques.Although the problem seems to be straight forward, it turned out that the solution is quite challenging because of the input size, data quality, calculation complexities and numerous EMR/Spark tuning options. We have been using New York taxi data from 2009 to 2017 to quantify the rides that can be joined together. The taxi rides were consolidated based on pickup location, pickup time and drop-off location. We have been calculating the percentage of taxi rides that can be joined. The benchmark originally set was rides within five minutes with a pickup and drop-off locations within half a kilometer. Then we started experimenting with different times and locations.We have been using parquet format, parallel Scala collections, compression, filtering, new column introduction, tuning parameters, I/O overhead tuning, bucketing, timeouts and partitioning.Over 1.2 billion rides were processed using Amazon EMR with Spark. We have been optimizing calculation time and processing price. Spark has hundreds of parameters, EMR has over fifty instances to choose from. It was challenging to process our data within reasonable time. We were able to find the optimal Spark queries (plans), tested different types of joins and compared their performances. Also, we were able to compare I/O and in-memory operations during partitioning and large files manipulation (the input file sizes were hundreds of Gigabytes).The results were amazing - we could consolidate around thirty five percent of total rides, saving tons of gas and improving environment and traffic in New York City.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call