Abstract

Data scientists and researchers utilize enormous spatio-temporal data and build machine learning models to solve practical problems in diverse domains including intelligent transportation, urban planning, epidemic prediction, and many more. Extracting application-specific features from big spatio-temporal data poses system requirements of heterogeneous data support, efficient and scalable computing over spatial and temporal dimensions, as well as a user-friendly programming interface. This paper presents ST4ML, a distributed spatio-temporal data processing system to support scalable machine-learning-oriented applications. We propose a three-stage pipelining computing framework, namely "selection-conversion-extraction" to abstract the distributed computing flow and implement it based on Apache Spark. To the best of our knowledge, ST4ML is the first of its kind to realize our design considerations. Extensive experiments with real-world datasets evidence that ST4ML outperforms straightforward extensions of existing ST data processing systems by up to an order of magnitude. ST4ML is open-sourced at https://github.com/Panrong/st4ml.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call