Embedded Descriptor Generation in Faster R-CNN for Multi-Object Tracking

Younis Younis,Khalil Alsaif

doi:10.33899/csmj.2021.170013

Abstract

With the rapid growth of computer usage to extract the required knowledge from a huge amount of information, such as a video file, significant attention has been brought towards multi-object detection and tracking. Artificial Neural Networks (ANNs) have shown outstanding performance in multi-object detection, especially the Faster R-CNN network. In this study, a new method is proposed for multi-object tracking based on descriptors generated by a neural network that is embedded in the Faster R-CNN. This embedding allows the proposed method to directly output a descriptor for each object detected by the Faster R-CNN, based on the features detected by the Faster R-CNN to detect the object. The use of these features allows the proposed method to output accurate values rapidly, as these features are already computed for the detection and have been able to provide outstanding performance in the detection stage. The descriptors that are collected from the proposed method are then clustered into a number of clusters equal to the number of objects detected in the first frame of the video. Then, for further frames, the number of clusters is increased until the distance between the centroid of the newly created cluster and the nearest centroid is less than the average distance among the centroids. Newly added clusters are considered for new objects, whereas older ones are kept in case the object reappears in the video. The proposed method is evaluated using the UA-DETRAC (University at Albany Detection and Tracking) dataset and has been able to achieve 64.8% MOTA and 83.6% MOTP, with a processing speed of 127.3 frames per second.

Highlights

With the rapidly growing use of computers to automate different types of applications, significant attention has been brought to object detection and tracking techniques, according to the enormous numbers of digital videos being captured daily and the huge amount of information they contain
In addition to the improvement produced by the proposed method to both the Multi-Object Tracking Accuracy (MOTA) and Multi-Object Tracking Precision (MOTP), compared to the existing techniques, significant improvement has been shown compared to the use of the same neural network but with the MHT method for object tracking
This concept has been the backbone of the Faster R-CNN, which has improved the performance of the Fast R-CNN by integrating the Region Proposal Network (RPN) in the same neural network and by processing the same inputs collected by the Fast R-CNN

Summary

Introduction

With the rapidly growing use of computers to automate different types of applications, significant attention has been brought to object detection and tracking techniques, according to the enormous numbers of digital videos being captured daily and the huge amount of information they contain. The proposed method embeds a neural network in the Faster R-CNN architecture in order to generate descriptors that represent the detected objects. These objects are clustered into groups, in order to predict the object that is being detected, whether to be an object that is detected in a previous frame or not. The remaining descriptors are assigned to the corresponding objects that are detected in the same position by the Faster R-CNN neural network These descriptors are clustered in order to track the objects in the different frames of the video.

Tracking Method

Method

Findings

Conclusion