Abstract

Depth estimation is requisite to build 3D perceiving capability of artificial intelligence of things (AIoT). Real-time inference with extremely low computing resource consumption is critical on edge devices. However, most single-view depth estimation networks focus on the improvement of accuracy running on high-end GPUs, which go opposite to the real-time requirement on edge devices. To address this issue, this article proposed a novel encoder-decoder network to realize real-time monocular depth estimation on edge devices. The proposed network merges semantic information at global field via an efficient transformer-based module to provide more details of the object for depth assignment. The transformer-based module is integrated in the lowest level resolution of an encoder-decoder architecture to largely reduce the parameters of the Vision Transformer (ViT). In particular, we proposed a novel patch convolutional layer for low-latency feature extraction in the encoder and a SConv5 layer for effective depth assignment in the decoder. The proposed network achieves outstanding balance between accuracy and speed on the NYU Depth v2 dataset. A low RMSE of 0.554 and a fast speed of 58.98 FPS on NVIDIA Jetson Nano device with TensorRT optimization are obtained on NYU Depth v2, outperforming most state-of-the-art real-time results.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call