“How many images do I need?” Understanding how sample size per class affects deep learning model performance metrics for balanced designs in autonomous wildlife monitoring

Saleh Shahinfar,Paul Meek,Greg Falzon

doi:10.1016/j.ecoinf.2020.101085

Saleh Shahinfar, Paul Meek + Show 1 more

Open Access

https://doi.org/10.1016/j.ecoinf.2020.101085

Copy DOI

Abstract

Deep learning (DL) algorithms are the state of the art in automated classification of wildlife camera trap images. The challenge is that the ecologist cannot know in advance how many images per species they need to collect for model training in order to achieve their desired classification accuracy. In fact there is limited empirical evidence in the context of camera trapping to demonstrate that increasing sample size will lead to improved accuracy.In this study we explore in depth the issues of deep learning model performance for progressively increasing per class (species) sample sizes. We also provide ecologists with an approximation formula to estimate how many images per animal species they need for certain accuracy level a priori. This will help ecologists for optimal allocation of resources, work and efficient study design.In order to investigate the effect of number of training images; seven training sets with 10, 20, 50, 150, 500, 1000 images per class were designed. Six deep learning architectures namely ResNet-18, ResNet-50, ResNet-152, DnsNet-121, DnsNet-161, and DnsNet-201 were trained and tested on a common exclusive testing set of 250 images per class. The whole experiment was repeated on three similar datasets from Australia, Africa and North America and the results were compared. Simple regression equations for use by practitioners to approximate model performance metrics are provided. Generalizes additive models (GAM) are shown to be effective in modelling DL performance metrics based on the number of training images per class, tuning scheme and dataset.Overall, our trained models classified images with 0.94 accuracy (ACC), 0.73 precision (PRC), 0.72 true positive rate (TPR), and 0.03 false positive rate (FPR). Variation in model performance metrics among datasets, species and deep learning architectures exist and are shown distinctively in the discussion section. The ordinary least squares regression models explained 57%, 54%, 52%, and 34% of expected variation of ACC, PRC, TPR, and FPR according to number of images available for training. Generalised additive models explained 77%, 69%, 70%, and 53% of deviance for ACC, PRC, TPR, and FPR respectively.Predictive models were developed linking number of training images per class, model, dataset to performance metrics. The ordinary least squares regression and Generalised additive models developed provides a practical toolbox to estimate model performance with respect to different numbers of training images.

Full Text