Supervised learning allows the prediction of variables measured in-situ from variables that can be measured from satellites. A labeled data set for this purpose is typically created by matching in-situ and satellite data and split into subsets for model training and initial validation. However, the available data are often not randomly distributed in space and time. In theory, this can bias estimates of prediction errors. Here, remote sensing of chlorophyll a in the Baltic Sea serves as an example to demonstrate the importance of this problem in marine remote sensing, and to test how well different statistical designs for validation mitigate it. Semi-synthetic data sets were created by combining daily chlorophyll a fields from a biogeochemical model hindcast with real-world locations and times of in-situ measurements, generated by sampling 2,000 combinations of cruises from an oceanographic database. These data sets were matched with co-located satellite data and used to train and validate four algorithms using remote sensing reflectances as input. The algorithms were validated using different methods including random hold-out sets and various block cross-validation designs based on geographical location, time, or location in predictor space. The resulting error estimates were compared to true errors calculated from differences to the biogeochemical model outputs serving as response variable. All validation methods underestimated prediction errors, in many cases by >30%. While a simple band-ratio algorithm had the smallest true errors (e.g., absolute percentage difference: APD = 50%), estimated errors were smallest for more complicated and in fact less accurate machine learning algorithms. For example, ten-fold cross-validation led to selection of the truly best algorithm among four candidates for <10% of data sets. The biases were smallest, but not absent, for spatial block cross-validation, which selected the truly best algorithm for 21–46% of data sets, depending on the error measure. When the analyses were repeated with data that were randomly distributed in space and time, the biases of error estimates based on random splits became much smaller (e.g., 10-fold cross-validation estimated errors within 2% of their true values and selected the truly best algorithm for >99% of data sets), spatial block cross-validation overestimated prediction errors (often by >40%), all algorithms achieved lower true errors, and a random forest made the most accurate predictions overall (APD = 27%). These results show that more attention should be paid to statistical methods for estimating the errors of supervised learning algorithms, e.g., by using multiple validation methods in combination and critically discussing error estimates considering questions of dependence, representativeness, and stationarity. Furthermore, non-random spatiotemporal distribution of labeled data can be a barrier to harnessing the full potential of machine learning algorithms in marine remote sensing.
Read full abstract