On the Importance of Training Data Sample Selection in Random Forest Image Classification: A Case Study in Peatland Ecosystem Mapping

Koreen Millard,Murray Richardson

doi:10.3390/rs70708489

Abstract

Random Forest (RF) is a widely used algorithm for classification of remotely sensed data. Through a case study in peatland classification using LiDAR derivatives, we present an analysis of the effects of input data characteristics on RF classifications (including RF out-of-bag error, independent classification accuracy and class proportion error). Training data selection and specific input variables (i.e., image channels) have a large impact on the overall accuracy of the image classification. High-dimension datasets should be reduced so that only uncorrelated important variables are used in classifications. Despite the fact that RF is an ensemble approach, independent error assessments should be used to evaluate RF results, and iterative classifications are recommended to assess the stability of predicted classes. Results are also shown to be highly sensitive to the size of the training data set. In addition to being as large as possible, the training data sets used in RF classification should also be (a) randomly distributed or created in a manner that allows for the class proportions of the training data to be representative of actual class proportions in the landscape; and (b) should have minimal spatial autocorrelation to improve classification results and to mitigate inflated estimates of RF out-of-bag classification accuracy.

Highlights

Random Forest (RF) is a widely used algorithm for classification of remotely sensed data
The results of this study demonstrate that RF image classification is highly sensitive to training data characteristics, including sample size, class proportions and spatial autocorrelation
We have demonstrated that the results of RF classification can be inconsistent depending on the input variables and strategy for selecting the training data used in classification

Summary

Introduction

Random Forest (RF) is a widely used algorithm for classification of remotely sensed data. The input variables (i.e., image channels) are randomly selected for building trees These characteristics of the algorithm allow RF to produce an accuracy assessment called “out-of-bag” error (rfOOB error) using the withheld training data as well as measures of variable importance based on the mean decrease in accuracy when a variable is not used in a building a tree. When performing image classification and accuracy assessments, training and validation data should be statistically independent (e.g., not clustered) [14] and representative of the entire landscape [10,12], and there should be abundant training data in all classes [15]. Many different training and validation sampling schemes are used throughout the literature, but without careful scrutiny of each dataset used and the specific assessment method, it may be difficult to compare results of classifications [11,16]. Care must be taken to ensure validation points are drawn from a sample independent of training data to avoid optimistic bias [14]

Objectives

Methods

Results

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Remote Sensing	Publication Date: Jul 6, 2015
Citations: 427	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

On the Importance of Training Data Sample Selection in Random Forest Image Classification: A Case Study in Peatland Ecosystem Mapping

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Remote Sensing

Lead the way for us

Similar Papers

Random forest classification for volcanogenic massive sulfide mineralization in the Rouyn-Noranda Area, Quebec
Pouran Behnia ... Eric A Roots
Ore Geology Reviews | VOL. 161
Pouran Behnia, et. al.Pouran Behnia ... Eric A Roots
16 Aug 2023
Ore Geology Reviews | VOL. 161

Downscaling soil hydrological mapping used to predict catchment hydrological response with random forests
Zisis Gagkas ... Allan Lilly
Geoderma | VOL. 341
Zisis Gagkas, et. al.Zisis Gagkas ... Allan Lilly
07 Feb 2019
Geoderma | VOL. 341

Using the 500 m MODIS land cover product to derive a consistent continental scale 30 m Landsat land cover classification
Hankui K Zhang ... David P Roy
Remote Sensing of Environment | VOL. 197
Hankui K Zhang, et. al.Hankui K Zhang ... David P Roy
25 May 2017
Remote Sensing of Environment | VOL. 197

Reduced Kernel Random Forest Technique for Fault Detection and Classification in Grid-Tied PV Systems
Khaled Dhibi ... Abdelmalek Kouadri
IEEE Journal of Photovoltaics | VOL. 10
Khaled Dhibi, et. al.Khaled Dhibi ... Abdelmalek Kouadri
05 Aug 2020
IEEE Journal of Photovoltaics | VOL. 10

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

On the Importance of Training Data Sample Selection in Random Forest Image Classification: A Case Study in Peatland Ecosystem Mapping

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Remote Sensing