Monocular Visual Scene Understanding from Mobile Platforms

Christian Wojek

doi:10.0253/tuprints-00002237

Abstract

Automatic visual scene understanding is one of the ultimate goals in computer vision and has been in the field’s focus since its early beginning. Despite continuous effort over several years, applications such as autonomous driving and robotics are still unsolved and subject to active research. In recent years, improved probabilistic methods became a popular tool for current state-of-the-art computer vision algorithms. Additionally, high resolution digital imaging devices and increased computational power became available. By leveraging these methodical and technical advancements current methods obtain encouraging results in well defined environments for robust object class detection, tracking and pixel-wise semantic scene labeling and give rise to renewed hope for further progress in scene understanding for real environments. This thesis improves state-of-the-art scene understanding with monocular cameras and aims for applications on mobile platforms such as service robots or driver assistance for automotive safety. It develops and improves approaches for object class detection and semantic scene labeling and integrates those into models for global scene reasoning which exploit context at different levels. To enhance object class detection, we perform a thorough evaluation for people and pedestrian detection with the popular sliding window framework. In particular, we address pedestrian detection from a moving camera and provide new benchmark datasets for this task. As frequently used single-window metrics can fail to predict algorithm performance, we argue for application-driven image-based evaluation metrics, which allow a better system assessment. We propose and analyze features and their combination based on visual and motion cues. Detection performance is evaluated systematically for different feature-classifiers combinations which is crucial to yield best results. Our results indicate that cue combination with complementary features allow improved performance. Despite camera ego-motion, we obtain significantly better detection results for motion-enhanced pedestrian detectors. Realistic onboard applications demand real-time processing with frame rates of 10 Hz and higher. In this thesis we propose to exploit parallelism in order to achieve the required runtime performance for sliding window object detection. In a case study we employ commodity graphics hardware for the popular histograms of oriented gradients (HOG) detection approach and achieve a significant speed-up compared to a baseline CPU implementation. Furthermore, we propose an integrated dynamic conditional random field model for joint semantic scene labeling and object detection in highly dynamic scenes. Our model improves semantic context modeling and fuses low-level filter bank responses with more global object detections. Recognition performance is increased for object as well as scene classes. Integration over time needs to account for different dynamics of objects and scene classes but yields more robust results. Finally, we propose a probabilistic 3D scene model that encompasses multi-class object detection, object tracking, scene labeling, and 3D geometric relations. This integrated 3D model is able to represent complex interactions like inter-object occlusion, physical exclusion between objects, and geometric context. Inference in this model allows to recover 3D scene context and perform 3D multi-object tracking from a mobile observer, for objects of multiple categories, using only monocular video as input. Our results indicate that our joint scene tracklet model for the evidence collected over multiple frames substantially improves performance. All experiments throughout this thesis are performed on challenging real world data. We contribute several datasets that were recorded from moving cars in urban and sub-urban environments. Highly dynamic scenes are obtained while driving in normal traffic on rural roads. Our experiments support that joint models, which integrate semantic scene labeling, object detection and tracking, are well suited to improve the individual stand-alone tasks’ performance.

Full Text