Deep Attention Models for Human Tracking Using RGBD.

Maryamsadat Rasoulidanesh,Shahram Payandeh,Srishti Yadav,Sachini Herath,Yasaman Vaghei

doi:10.3390/s19040750

Maryamsadat Rasoulidanesh, Shahram Payandeh + Show 3 more

Open Access

PDF Available

https://doi.org/10.3390/s19040750

Copy DOI

Export

Save

Cite

Journal: Sensors	Publication Date: Feb 13, 2019
Citations: 16	License type: CC BY 4.0

Affiliation: Simon Fraser University

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

Visual tracking performance has long been limited by the lack of better appearance models. These models fail either where they tend to change rapidly, like in motion-based tracking, or where accurate information of the object may not be available, like in color camouflage (where background and foreground colors are similar). This paper proposes a robust, adaptive appearance model which works accurately in situations of color camouflage, even in the presence of complex natural objects. The proposed model includes depth as an additional feature in a hierarchical modular neural framework for online object tracking. The model adapts to the confusing appearance by identifying the stable property of depth between the target and the surrounding object(s). The depth complements the existing RGB features in scenarios when RGB features fail to adapt, hence becoming unstable over a long duration of time. The parameters of the model are learned efficiently in the Deep network, which consists of three modules: (1) The spatial attention layer, which discards the majority of the background by selecting a region containing the object of interest; (2) the appearance attention layer, which extracts appearance and spatial information about the tracked object; and (3) the state estimation layer, which enables the framework to predict future object appearance and location. Three different models were trained and tested to analyze the effect of depth along with RGB information. Also, a model is proposed to utilize only depth as a standalone input for tracking purposes. The proposed models were also evaluated in real-time using KinectV2 and showed very promising results. The results of our proposed network structures and their comparison with the state-of-the-art RGB tracking model demonstrate that adding depth significantly improves the accuracy of tracking in a more challenging environment (i.e., cluttered and camouflaged environments). Furthermore, the results of depth-based models showed that depth data can provide enough information for accurate tracking, even without RGB information.

Highlights

Despite recent progress in computer vision with the introduction of deep learning, tracking in a cluttered environment remains a challenging task due to various situations such as illumination changes, color camouflage, and the presence of other distractions in the scene.Sensors 2019, 19, 750; doi:10.3390/s19040750 www.mdpi.com/journal/sensorsOne of the recent trends in deep learning, attention models, are inspired by the visual perception and cognition system in humans [1], which helps to reduce the effect of distraction in the scene to improve tracking accuracy
The results show that adding depth in all three methods improves the intersection over union (IoU), and the
We showed that adding depth increases accuracy, especially in more challenging environments

Summary

Introduction

Despite recent progress in computer vision with the introduction of deep learning, tracking in a cluttered environment remains a challenging task due to various situations such as illumination changes, color camouflage, and the presence of other distractions in the scene.Sensors 2019, 19, 750; doi:10.3390/s19040750 www.mdpi.com/journal/sensorsOne of the recent trends in deep learning, attention models, are inspired by the visual perception and cognition system in humans [1], which helps to reduce the effect of distraction in the scene to improve tracking accuracy. The human eye has an innate ability to interpret complex scenes with remarkable accuracy in real time It tends to process only a subset of the entire sensory information available to it. This reduces the eyes work to analyze complex visual scenarios. By making a selective decision about the object of interest, fewer pixels need to be processed and the uninvolved pixels are ignored, which leads to lower complexity and higher tracking accuracy. As a result, this mechanism appears to be key in handling clutter, distractions, and occlusions in target tracking

Methods

Results

Conclusion