Machine Learning on Spark for the Optimal IDW-based Spatiotemporal Interpolation

Weitian Tong,Lixin Li,Jason Franklin,Gina Besenyi,Xiaolu Zhou

doi:10.21433/b3114dw721gn

Abstract

GIScience 2016 Short Paper Proceedings Machine Learning on Spark for the Optimal IDW-based Spatiotemporal Interpolation Weitian Tong 1 , Jason Franklin 1 , Xiaolu Zhou 2 , Lixin Li 1* , Gina Besenyi 3 Department of Computer Sciences, Department of Geology and Geography, Georgia Southern University, P.O. Box 7997, Statesboro, GA 30460, USA Emails: {wtong; jf00936; xzhou; lli}@georgiasouthern.edu Clinical and Digital Health Sciences, CAHS, Augusta University, Augusta, GA 30912, USA Email: gbesenyi@augusta.edu Abstract To improve current spatiotemporal interpolation methods for public health applications (Li et al., 2010), we combine the extension approach (Li and Revesz, 2004) with machine learning methods, employ the efficient k-d tree structure to store data, and implement our method on Apache Spark (Spark, 2016). Preliminary results demonstrate the computational power of our method, which outperforms the previous work in terms of speed and generates comparable results in terms of accuracy (Li et al., 2014). Future research will continue exploring this method to improve the interpolation accuracy and efficiency, with the long term objective of establishing associations between air pollution exposure and adverse health effects. 1. Introduction To implement the spatiotemporal interpolation method, Li and Revesz (2004) proposed an extension approach, which resolves the spatiotemporal interpolation into a higher-dimensional spatial interpolation by treating time as an asymmetric dimension in space. Unfortunately, modern work on spatiotemporal interpolation (Pebesma, 2012; Graler et al., 2013; Losser et al., 2014; Li et al., 2014, etc) utilizes simplistic methods to scale the range of the time dimension. In recent work, Li et al. (2014) extended the inverse distance weighted (IDW) method (Shepard, 1968) to model the PM 2.5 exposure risk by scaling the time domain with a parameter !, which is a similar concept to the spatiotemporal anisotropy parameter (Graler et al., 2014). In applying the extension approach to the spatial IDW method to interpolate the spatiotemporal data, we arrived at the following formulae where #, %, !& represents the unknown value to be calculated at the un-sampled location (#, %) and time instance t, ! is the spatiotemporal anisotropy parameter, 6 is the exponent that influences the weighting of ) , and n is the number of nearest neighbors. Applying k-fold cross validation (k-CV) to the training set can discover the optimal parameters !, 6 and 7 for this data set in order to estimate the daily PM 2.5 concentration values at unknown points. Building upon this work, our method parallelizes the implementation of the original IDW algorithm using Correspondence Author

Full Text