Abstract
Semi-supervised learning (SSL) has promising potential for improving model performance using both labelled and unlabelled data. Since recovering 3D information from 2D images is an ill-posed problem, the current state-of-the-art methods of monocular 3D object detection (Mono3D) have relatively low precision and recall, making semi-supervised learning for Mono3D tasks challenging and understudied. In this work, we propose a unified and effective semi-supervised learning framework called Mix-Teaching that can be applied to most monocular 3D object detectors. Based on the idea of decomposition and recombination, unlabelled samples are firstly decomposed into collections of image patches with high-quality predictions and collections of background images containing no objects. The student model is then trained on the mixed images containing dense instances with high-quality pseudo-labels generated by the recombination operation. In addition, we propose an uncertainty-based filter to distinguish high-quality pseudo-labels from noisy predictions during the decomposition process. As results in KITTI and nuScenes benchmarks, Mix-Teaching consistently improves MonoFlex and GUPNet by significant margins under various labeling ratios. Our method achieves around +6.34% <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">AP</i> <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">3<i>D</i></sub> improvement against the GUPNet on the validation set when using only 10% labelled data. Using the full training set and the additional 38K raw images from KITTI, it can further improve the MonoFlex by +4.65% absolute improvement on <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">AP</i> <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">3<i>D</i></sub> for car detection, reaching 18.54% <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">AP</i> <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">3<i>D</i></sub> , which ranks the 1st place among all monocular based methods on the KITTI test leaderboard.
Submitted Version (Free)
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have