Abstract

Nowadays, video surveillance has become ubiquitous with the quick development of artificial intelligence. Multi-object detection (MOD) is a key step in video surveillance and has been widely studied for a long time. The majority of existing MOD algorithms follow the “divide and conquer” pipeline and utilize popular machine learning techniques to optimize algorithm parameters. However, this pipeline is usually suboptimal since it decomposes the MOD task into several sub-tasks and does not optimize them jointly. In addition, the frequently used supervised learning methods rely on the labeled data which are scarce and expensive to obtain. Thus, we propose an end-to-end Unsupervised Multi-Object Detection framework for video surveillance, where a neural model learns to detect objects from each video frame by minimizing the image reconstruction error. Moreover, we propose a Memory-Based Recurrent Attention Network to ease detection and training. The proposed model was evaluated on both synthetic and real datasets, exhibiting its potential.

Highlights

  • Video surveillance aims to analyze video data recorded by cameras

  • Classical methods such as Deformable Part Models (DPMs) [1] follow the “divide and conquer” pipeline that a sliding window approach is first used to generate image regions, a classifier is employed to categorize each region into object/non-object, and post-processing is applied to refine the bounding boxes of object regions. To improve both the efficiency and performance of Multi-object detection (MOD), methods based on Region-based Convolutional Neural Networks (R-CNNs) [3,4,5,6] are proposed and perform well on various popular object detection datasets [7,8,9,10,11]

  • To quantitatively assess the model, we evaluated different configurations with the commonly used MOD metrics, including the Average Precision (AP) [9], Multi-Object Detection Accuracy (MODA), Multi-Object Detection Precision (MODP) [28], average False Alarm number per Frame (FAF), total True Positive number (TF), total False Positive number (FP), total False Negative number (FN), Precision (TP/(TP + FP)), and Recall (TP/(TP + FN))

Read more

Summary

Introduction

Video surveillance aims to analyze video data recorded by cameras. It has been widely used in crime prevention, industrial processes, traffic monitoring, sporting events, etc. Multi-object detection (MOD) from visual data has been extensively studied for many years by computer vision communities Classical methods such as Deformable Part Models (DPMs) [1] follow the “divide and conquer” pipeline that a sliding window approach is first used to generate image regions, a classifier (e.g., a Support Vector Machine [2]) is employed to categorize each region into object/non-object, and post-processing is applied to refine the bounding boxes of object regions (e.g., removing outliers, merging duplicates, and rectifying boundaries). We assess the proposed model on both the synthetic dataset (Sprites) and the real dataset (DukeMTMC [17]), exhibiting its advantages and practicality

Unsupervised Multi-Object Detection
Image Encoder
Recurrent Object Detector
Renderer
Memory-Based Recurrent Attention Networks
Experiments
Sprites
DukeMTMC
Visualizing the UMOD-MRAN
Related Work
Conclusions
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.