Abstract

Robotic detection of people in crowded and/or cluttered human-centered environments including hospitals, stores and airports is challenging as people can become occluded by other people or objects, and deform due to clothing or pose variations. There can also be loss of discriminative visual features due to poor lighting. In this paper, we present a novel multimodal person detection architecture to address the mobile robot problem of person detection under intraclass variations. We present a two-stage training approach using: 1) a unique pretraining method we define as Temporal Invariant Multimodal Contrastive Learning (TimCLR), and 2) a Multimodal YOLOv4 (MYOLOv4) detector for finetuning. TimCLR learns person representations that are invariant under intraclass variations through unsupervised learning. Our approach is unique in that it generates image pairs from natural variations within multimodal image sequences and contrasts crossmodal features to transfer invariances between different modalities. These pretrained features are used by the MYOLOv4 detector for finetuning and person detection from RGB-D images. Extensive experiments validate the performance of our DL architecture in both human-centered crowded and cluttered environments. Results show that our method outperforms existing unimodal and multimodal person detection approaches in detection accuracy when considering body occlusions and pose deformations in different lighting.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call