Ensemble Random Forests as a tool for modeling rare occurrences

Za Siders,S Martin,D Kobayashi,Tt Jones,Nd Ducharme-Barth,J Raynor,F Carvalho,Rnm Ahrens

doi:10.3354/esr01060

Abstract

Relative to target species, priority conservation species occur rarely in fishery interactions, resulting in imbalanced, overdispersed data. We present Ensemble Random Forests (ERFs) as an intuitive extension of the Random Forest algorithm to handle rare event bias. Each Random Forest receives individual stratified randomly sampled training/test sets, then down-samples the majority class for each decision tree. Results are averaged across Random Forests to generate an ensemble prediction. Through simulation, we show that ERFs outperform Random Forest with and without down-sampling, as well as with the synthetic minority over-sampling technique, for highly class imbalanced to balanced datasets. Spatial covariance greatly impacts ERFs’ perceived performance, as shown through simulation and case studies. In case studies from the Hawaii deep-set longline fishery, giant manta ray Mobula birostris syn. Manta birostris and scalloped hammerhead Sphyrna lewini presence had high spatial covariance and high model test performance, while false killer whale Pseudorca crassidens had low spatial covariance and low model test performance. Overall, we find ERFs have 4 advantages: (1) reduced successive partitioning effects; (2) prediction uncertainty propagation; (3) better accounting for interacting covariates through balancing; and (4) minimization of false positives, as the majority of Random Forests within the ensemble vote correctly. As ERFs can readily mitigate rare event bias without requiring large presence sample sizes or imparting considerable balancing bias, they are likely to be a valuable tool in bycatch and species distribution modeling, as well as spatial conservation planning, especially for protected species where presence can be rare.

Highlights

Machine learning algorithms have proven to be a ubiquitous tool for modeling species distributions
The Ensemble Random Forests (ERFs) approach did as well or slightly better than Random Forest (RF)-DS and RF-synthetic minority oversampling technique (SMOTE) approaches for area under the curve (AUC), root mean squared error (RMSE), and true skill statistic (TSS) performance metrics across the range of detection probabilities for the external holdout dataset (Fig. 2A−C) and the full dataset (Fig. 2D−F)
The RF approach outperformed all other approaches for the RMSE performance metric for all detection probabilities that resulted in class imbalance (δ ≥ 0.25) (Fig. 2B,E)

Summary

INTRODUCTION

Machine learning algorithms have proven to be a ubiquitous tool for modeling species distributions. A variety of methods have been utilized in ecology to deal with overdispersion (e.g. zero-inflated, hurdle, or delta− generalized linear models), most of which implement a mixture model in some fashion (Zuur et al 2009, Campbell 2015, Stock et al 2020) While these models can be well suited to modeling rare event data, large sample sizes are often needed to overcome the low proportion of presences and estimate model parameters with adequate precision (He & Garcia 2009, Rodriguez-Galiano et al 2012). Manta birostris (White et al 2018) (Threatened), scalloped hammerhead Sphyrna lewini (Endangered), and false killer whale Pseudorca crassidens (Endangered) The purpose of these simulation and case study objectives was to demonstrate the effects of sample size and sample covariation on model performance and the utility of ERFs as a bycatch and species distribution modeling tool.

Background

Implementation

Performance

Simulated covariates

Titration simulation

Spatial contrast simulation

Fisheries-dependent data

Covariates

Titration case study

Spatial contrast case studies

DISCUSSION

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Endangered species research	Publication Date: Oct 8, 2020
Citations: 16	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Ensemble Random Forests as a tool for modeling rare occurrences

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Endangered species research

Lead the way for us

Similar Papers

Protein-protein interaction sites prediction by ensemble random forests with synthetic minority oversampling technique.
Xiaoying Wang ... Alfonso Valencia
Computer applications in the biosciences : CABIOS | VOL. 35
Xiaoying Wang, et. al.Xiaoying Wang ... Alfonso Valencia
05 Dec 2018
Computer applications in the biosciences : CABIOS | VOL. 35

Data-Driven Cervical Cancer Prediction Model with Outlier Detection and Over-Sampling Methods.
Muhammad Fazal Ijaz ... Youngdoo Son
Sensors (Basel, Switzerland) | VOL. 20
Muhammad Fazal Ijaz, et. al.Muhammad Fazal Ijaz ... Youngdoo Son
15 May 2020
Sensors (Basel, Switzerland) | VOL. 20

Depth Limitation and Splitting Criteria Optimization on Random Forest for Efficient Human Activity Classification
Syarif Hidayat ...
International Journal of Advanced Computer Science and Applications | VOL. 10
Syarif Hidayat, et. al.Syarif Hidayat ...
01 Jan 2019
International Journal of Advanced Computer Science and Applications | VOL. 10

Machine learning model for diagnostic method prediction in parasitic disease using clinical information
You Won Lee ... Eun-Hee Shin
Expert systems with applications | VOL. 185
You Won Lee, et. al.You Won Lee ... Eun-Hee Shin
26 Jul 2021
Expert systems with applications | VOL. 185

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Ensemble Random Forests as a tool for modeling rare occurrences

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Endangered species research