Abstract

Vehicle detection, the process of identifying vehicles as axis-aligned bounding boxes in still images, is widely used to estimate the range, time-to-collision, and motion of autonomous vehicles (AVs). Bounding boxes, while convenient, are too coarse to adapt well to vehicle shape and pose variations. In this work, we present TBox (Trapezoid & Box), a novel fine-grained representation useful for both localization and recognition that extends the bounding box by restricting the spatial extent of a vehicle to a set of keypoints and indicating semantically significant local areas using subclasses. In contrast to the previous monolithic models, we propose a cascaded anchor-free architecture to estimate the bounding box and TBox. One subnetwork uses a stacked hourglass network to detect each vehicle as a pair of corners without using anchors. Specifically, it learns corner affinity fields, enabling it to perform robust corner grouping. The other subnetwork estimates a TBox as a set of keypoints. This subnetwork utilizes the bounding box results to avoid ambiguous keypoint associations and reuses existing features to reduce the number of parameters. We also propose a multitask learning strategy for training the cascaded model that implicitly integrates the global context with local details, introducing improvements for both tasks. During testing, a refinement algorithm explicitly uses robust local keypoints to correct possible global box errors, ensuring tight geometric representations for nearby critical vehicles. The experiments show that our method outperforms existing anchor-free detectors for vehicle detection and achieves better performance on the TBox task while using a small model.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call