Machine learning calibration of low-cost NO&amp;lt;sub&amp;gt;2&amp;lt;/sub&amp;gt; and PM&amp;lt;sub&amp;gt;10&amp;lt;/sub&amp;gt; sensors: non-linear algorithms and their impact on site transferability

Peer Nowack,Hannah Gardiner,John Cant,Lev Konstantinovskiy

doi:10.5194/amt-14-5637-2021

Machine learning calibration of low-cost NO&lt;sub&gt;2&lt;/sub&gt; and PM&lt;sub&gt;10&lt;/sub&gt; sensors: non-linear algorithms and their impact on site transferability

Peer Nowack, Hannah Gardiner + Show 2 more

Open Access

https://doi.org/10.5194/amt-14-5637-2021

Copy DOI

Abstract

Abstract. Low-cost air pollution sensors often fail to attain sufficient performance compared with state-of-the-art measurement stations, and they typically require expensive laboratory-based calibration procedures. A repeatedly proposed strategy to overcome these limitations is calibration through co-location with public measurement stations. Here we test the idea of using machine learning algorithms for such calibration tasks using hourly-averaged co-location data for nitrogen dioxide (NO2) and particulate matter of particle sizes smaller than 10 µm (PM10) at three different locations in the urban area of London, UK. We compare the performance of ridge regression, a linear statistical learning algorithm, to two non-linear algorithms in the form of random forest regression (RFR) and Gaussian process regression (GPR). We further benchmark the performance of all three machine learning methods relative to the more common multiple linear regression (MLR). We obtain very good out-of-sample R2 scores (coefficient of determination) >0.7, frequently exceeding 0.8, for the machine learning calibrated low-cost sensors. In contrast, the performance of MLR is more dependent on random variations in the sensor hardware and co-located signals, and it is also more sensitive to the length of the co-location period. We find that, subject to certain conditions, GPR is typically the best-performing method in our calibration setting, followed by ridge regression and RFR. We also highlight several key limitations of the machine learning methods, which will be crucial to consider in any co-location calibration. In particular, all methods are fundamentally limited in how well they can reproduce pollution levels that lie outside those encountered at training stage. We find, however, that the linear ridge regression outperforms the non-linear methods in extrapolation settings. GPR can allow for a small degree of extrapolation, whereas RFR can only predict values within the training range. This algorithm-dependent ability to extrapolate is one of the key limiting factors when the calibrated sensors are deployed away from the co-location site itself. Consequently, we find that ridge regression is often performing as good as or even better than GPR after sensor relocation. Our results highlight the potential of co-location approaches paired with machine learning calibration techniques to reduce costs of air pollution measurements, subject to careful consideration of the co-location training conditions, the choice of calibration variables and the features of the calibration algorithm.

Highlights

Air pollutants such as nitrogen dioxide (NO2) and particulate matter (PM) have harmful impacts on human health, the ecosystem and public infrastructure (European Environment Agency, 2019)
Each node allows for simultaneous measurement of multiple air pollutants, but we will focus on individual calibrations for NO2 and particle sizes smaller than 10 μm (PM10) here, because these species were of particular interest to our own measurement campaigns
For co-location measurements, there will be time-dependent fluctuations in the value ranges encountered for the predictors and predictands

Summary

Introduction

Air pollutants such as nitrogen dioxide (NO2) and particulate matter (PM) have harmful impacts on human health, the ecosystem and public infrastructure (European Environment Agency, 2019). Our focus is on testing the advantages and disadvantages of machine learning calibration techniques for low-cost NO2 and PM10 sensors. The principal idea is to calibrate the sensors through co-location with established high-performance air pollution measurement stations (Fig. 1). Such calibration techniques, if successful, could complement more expensive laboratory-based calibration approaches, thereby further reducing the costs of the overall measurement process We compare three machine learning regression techniques in the form of ridge regression, random forest regression (RFR) and Gaussian process regression (GPR), and we contrast the results to those obtained with standard multiple linear regression (MLR). We investigate well-known issues concerning site transferability (Masson et al, 2015; Fang and Bate, 2017; Hagan et al, 2018; Malings et al, 2019), i.e. if a calibration through co-location at one location gives rise to reliable measurements at a different location

Objectives

Methods

Results

Conclusion