FINE-TUNING DEEP LEARNING MODELS FOR PEDESTRIAN DETECTION

Caisse Amisse,Mario Ernesto Jijón-Palma,Jorge Antonio Silva Centeno

doi:10.1590/s1982-21702021000200013

Caisse Amisse, Mario Ernesto Jijón-Palma + Show 1 more

Open Access

https://doi.org/10.1590/s1982-21702021000200013

Copy DOI

Journal: Boletim De Ciencias Geodesicas	Publication Date: Jan 1, 2021
Citations: 8	License type: CC BY 4.0

Affiliation: Federal University of Paraná

Abstract

Object detection in high resolution images is a new challenge that the remote sensing community is facing thanks to introduction of unmanned aerial vehicles and monitoring cameras. One of the interests is to detect and trace persons in the images. Different from general objects, pedestrians can have different poses and are undergoing constant morphological changes while moving, this task needs an intelligent solution. Fine-tuning has woken up great interest among researchers due to its relevance for retraining convolutional networks for many and interesting applications. For object classification, detection, and segmentation fine-tuned models have shown state-of-the-art performance. In the present work, we evaluate the performance of fine-tuned models with a variation of training data by comparing Faster Region-based Convolutional Neural Network (Faster R-CNN) Inception v2, Single Shot MultiBox Detector (SSD) Inception v2, and SSD Mobilenet v2. To achieve the goal, the effect of varying training data on performance metrics such as accuracy, precision, F1-score, and recall are taken into account. After testing the detectors, it was identified that the precision and recall are more sensitive on the variation of the amount of training data. Under five variation of the amount of training data, we observe that the proportion of 60%-80% consistently achieve highly comparable performance, whereas in all variation of training data Faster R-CNN Inception v2 outperforms SSD Inception v2 and SSD Mobilenet v2 in evaluated metrics, but the SSD converges relatively quickly during the training phase. Overall, partitioning 80% of total data for fine-tuning trained models produces efficient detectors even with only 700 data samples.

Highlights

The availability of a large amount of image sequences obtained using security cameras of video cameras installed on unmanned aerial vehicles, or low cost imaging sensors such as smartphones opened a large series of new application to the remote sensing community
One solution to this problem is the use of artificial intelligence techniques, like convolutional neural networks (CNN) that have the disadvantage of requiring a large amount of training samples and computational effort, generally using a Graphical Processing Units (GPUs) to speed up the process, to achieve the desired performance
It can be seen from this table that the Faster R-CNN Inception v2 model offers the maximum performance in pedestrian detection, namely high precision, recall, accuracy and F1-score in almost all experiments

Summary

Introduction

The availability of a large amount of image sequences obtained using security cameras of video cameras installed on unmanned aerial vehicles (drones), or low cost imaging sensors such as smartphones opened a large series of new application to the remote sensing community. There is the possibility to detect and track persons in urban scenes for security purposes. Pedestrian detection in video sequences is a challenging problem because the appearance of the pedestrian changes from image to image along the scene. A flexible model and high computation effort are necessary to perform this task with accuracy. One solution to this problem is the use of artificial intelligence techniques, like convolutional neural networks (CNN) that have the disadvantage of requiring a large amount of training samples and computational effort, generally using a Graphical Processing Units (GPUs) to speed up the process, to achieve the desired performance. Gathering and labelling large datasets and training a network for specific tasks is impractical and time-consuming

Methods

Results

Conclusion