Abstract

We consider parallel computation for Gaussian process calculations to overcome computational and memory constraints on the size of datasets that can be analyzed. Using a hybrid parallelization approach that uses both threading (shared memory) and message-passing (distributed memory), we implement the core linear algebra operations used in spatial statistics and Gaussian process regression in an R package called bigGP that relies on C and MPI. The approach divides the matrix into blocks such that the computational load is balanced across processes while communication between processes is limited. The package provides an API enabling R programmers to implement Gaussian process-based methods by using the distributed linear algebra operations without any C or MPI coding. We illustrate the approach and software by analyzing an astrophysics dataset with n=67,275 observations.

Highlights

  • Gaussian processes are widely used in statistics and machine learning for spatial and spatiotemporal modeling [Banerjee et al, 2003], design and analysis of computer experiments [Kennedy and O’Hagan, 2001], and non-parametric regression [Rasmussen and Williams, 2006]

  • One popular example is the spatial statistics method of kriging, which is equivalent to conditional expectation under a Gaussian process model for the unknown spatial field

  • As a result of the computational and memory limitations, standard spatial statistics methods are typically applied to datasets with at most a few thousand observations

Read more

Summary

Introduction

Gaussian processes are widely used in statistics and machine learning for spatial and spatiotemporal modeling [Banerjee et al, 2003], design and analysis of computer experiments [Kennedy and O’Hagan, 2001], and non-parametric regression [Rasmussen and Williams, 2006]. As a result of the computational and memory limitations, standard spatial statistics methods are typically applied to datasets with at most a few thousand observations To overcome these limitations, a small industry has arisen to develop computationallyefficient approaches to spatial statistics, involving reduced rank approximations [Kammann and Wand, 2003, Banerjee et al, 2008, Cressie and Johannesson, 2008], tapering the covariance matrix to induce sparsity [Furrer et al, 2006, Kaufman et al, 2008], approximation of the likelihood [Stein et al, 2004], and fitting local models by stratifying the spatial domain [Gramacy and Lee, 2008], among others. We present an algorithm and R package, bigGP, for distributed linear algebra calculations focused on those used in spatial statistics and closely-related Gaussian process regression methods. We illustrate the use of the software for Gaussian process regression in an astrophysics application

Distributed linear algebra calculations
Our algorithm
Memory use
Advantages of our approach
Overview
Kriging implementation
Using the API
Timing results
Choice of h and comparison to ScaLAPACK
Timing with increasing problem size
Effect of number of cores per process
Using GPUs to speed up the linear algebra
Background
Statistical model
R code
Results
Discussion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.