Estimating Brazilian Amazon Canopy Height Using Landsat Reflectance Products in a Random Forest Model with Lidar as Reference Data

Pedro V C Oliveira,Hankui K Zhang,Xiaoyang Zhang

doi:10.3390/rs16142571

Abstract

Landsat data have been used to derive forest canopy structure, height, and volume using machine learning models, i.e., giving computers the ability to learn from data and make decisions and predictions without being explicitly programmed, with training data provided by ground measurement or airborne lidar. This study explored the potential use of Landsat reflectance and airborne lidar data as training data to estimate canopy heights in the Brazilian Amazon forest and examined the impacts of Landsat reflectance products at different process levels and sample spatial autocorrelation on random forest modeling. Specifically, this study assessed the accuracy of canopy height predictions from random forest regression models impacted by three different Landsat 8 reflectance product inputs (i.e., USGS level 1 top of atmosphere reflectance, USGS level 2 surface reflectance, and NASA nadir bidirectional reflectance distribution function (BRDF) adjusted reflectance (NBAR)), sample sizes, training/test split strategies, and geographic coordinates. In the establishment of random forest regression models, the dependent variable (i.e., the response variable) was the dominant canopy heights at a 90 m resolution derived from airborne lidar data, while the independent variables (i.e., the predictor variables) were the temporal metrics extracted from each Landsat reflectance product. The results indicated that the choice of Landsat reflectance products had an impact on model accuracy, with NBAR data yielding more trustful results than the other products despite having higher RMSE values. Training and test split strategy also affected the derived model accuracy metrics, with the random sample split (randomly distributed training and test samples) showing inflated accuracy compared to the spatial split (training and test samples spatially set apart). Such inflation was induced by the spatial autocorrelation that existed between training and test data in the random split. The inclusion of geographic coordinates as independent variables improved model accuracy in the random split strategy but not in the spatial split, where training and test samples had different geographic coordinate ranges. The study highlighted the importance of data processing levels and the training and test split methods in random forest modeling of canopy height.

Full Text