Abstract
The random forest (RF) algorithm is an ensemble of classification or regression trees and is widely used, including for species distribution modelling (SDM). Many researchers use implementations of RF in the R programming language with default parameters to analyse species presence‐only data together with ‘background' samples. However, there is good evidence that RF with default parameters does not perform well for such ‘presence‐background' modelling. This is often attributed to the disparity between the number of presence and background samples, also known as 'class imbalance', and several solutions have been proposed. Here, we first set the context: the background sample should be large enough to represent all environments in the region. We then aim to understand the drivers of poor performance of RF when models are fitted to presence‐only species data alongside background samples. We show that 'class overlap' (where both classes occur in the same environment) is an important driver of poor performance, alongside class imbalance. Class overlap can even degrade performance for presence–absence data. We explain, test and evaluate suggested solutions. Using simulated and real presence‐background data, we compare performance of default RF with other weighting and sampling approaches. Our results demonstrate clear evidence of improvement in the performance of RFs when techniques that explicitly manage imbalance are used. We show that these either limit or enforce tree depth. Without compromising the environmental representativeness of the sampled background, we identify approaches to fitting RF that ameliorate the effects of imbalance and overlap and allow excellent predictive performance. Understanding the problems of RF in presence‐background modelling allows new insights into how best to fit models, and should guide future efforts to best deal with such data.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.