Abstract

Environmental perception is a key task for autonomous vehicles to ensure intelligent planning and safe decision-making. Most current state-of-the-art perceptual methods in vehicles, and in particular for 3D object detection, are based on a single-frame reference. However, these methods do not effectively utilise temporal information associated with the objects or the scene from the input data sequences. The work presented in this paper corroborates the use of spatial and temporal information through multi-frame, lidar, point cloud data to leverage spatio-temporal contextual information and improve the accuracy of 3D object detection. The study also gathers more insights into the effect of inducing temporal information into a network and the overall performance of the deep learning model. We consider the Frustum-ConvNet architecture as the baseline model and propose methods to incorporate spatio-temporal information using convolutional-LSTMs to detect the 3D object detection using lidar data. We also propose to employ an attention mechanism with temporal encoding to stimulate the model to focus on salient feature points within the region proposals. The results from this study shows the inclusion of temporal information considerably improves the true positive metric specifically the orientation error of the 3D bounding box from 0.819 to 0.784 and 0.294 to 0.111 for cars and pedestrian classes respectively on the customized subset of nuScenes training dataset. The overall nuScenes detection score (NDS) is improved from 0.822 to 0.837 compared to the baseline.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call