Abstract While remote sensing can be an effective tool in building a forest inventory, field measurements and model fitting can be both expensive and challenging. One strategy to reduce forest inventory costs is to leverage forest inventory models fitted to a different population (external models), although the effectiveness of external models is poorly understood. One concern is that models may predict well to the sample data, but poorly to the population—which is termed ‘overfitting’. The effect of overfit may be especially problematic in attempts to predict for a different population (a forest area not covered by any sample plots). Assessing overfit is difficult and its consequence for estimation are not well understood, especially in the context of prediction using external models. This study assesses how overfitting affects model-assisted forest inventory estimation when using internal and external models. We used field and remotely sensed data (Sentinel-2 images and airborne laser scanning data) from two forest areas in Finland. We evaluated four modeling approaches: ordinary least square regression (OLS), random forest, k-nearest neighbors, and gaussian process regression. Both analytical and bootstrap variance estimators were used to evaluate model-assisted estimation performance. Internal models, especially OLS, were the most affected by model overfitting, leading to bias in the population means and underestimation of variance. Estimates using external models provided unbiased means and realistic intervals except in the case of deliberate excessive overfitting. The bootstrap variance estimator was found to be more robust to overfit than the analytical variance estimator for the internal model, but was not helpful for the external model. Internal models should be parsimonious to generalize well to the population and avoid bias. The bootstrap estimator of variance is recommended for internal models, especially if there is concern about model overfitting.
Read full abstract