Human detection still suffers from occlusion, complex backgrounds, and scale-variant problems. Projecting three-dimensional (3D) points onto the ground to generate an orthographic top view (OTV) image for detection can effectively alleviate these problems. However, depth sensors may be placed arbitrarily, making it difficult to create OTV images by the dense point cloud converted from a depth image. We focus on the generation of OTV images and human detection via the constructed OTV image. First, we propose a ground plane extraction method that is well suitable for various camera positions and orientations in complex scenes. Next, points are converted to a uniform coordinate system by ground parameters and encoded to generate a three-channel OTV image. Then, the mainstream two-dimensional (2D) network is employed to detect the human directly on OTV images and further obtain the 3D bounding box by computing the mapping from the OTV image. Besides, we propose a semiautomated annotation method to solve the problem of few OTV image annotations. The proposed method is evaluated on the EPFL dataset, including two subsets, and achieves state-of-the-art performance compared with the existing approaches. Moreover, our 2D and 3D human detection method can run more than 26FPS on the CPU.
Read full abstract