Accurate and robust semantic scene understanding for urban driving is challenging due to the complex object types, object motion, and as well as ego-motion. Typical approaches to this problem use the fusion of multiple sensors such as camera, IMU, LiDAR, and radar to identify the state of the surrounding objects, including distance, direction, position, and velocity. However, such sensor modalities are very much complex and costly. This paper proposes a new framework for object identification (FOI) from a moving camera in a complex driving environment utilizing only camera sensor data. The framework is capable of detecting objects, extracting their behavioral features in terms of motion, position, velocity, and distance. All of this information (referred to as object-wise semantic information) are fused in order to acquire a better understanding of the driving scenario. The work addresses ego-motion compensation and extraction of accurate motion information of moving objects from a moving camera using image registration and optical flow estimation. A moving object detection model is designed within the framework by integrating an encoder–decoder network with a semantic segmentation network. The approach involves two mutual tasks: semantic segmentation of objects into specific classes and binary pixel classification to predict whether the detected object is moving or static based on temporal information. The work also contributes a novel dataset for moving object detection that covers all types of dynamic objects. The evaluation of FOI has been performed on different sequences of KITTI, EU-life long, and the proposed datasets. The experimental results show that the proposed framework provides accurate object-wise semantic information.
Read full abstract