Abstract
Depth estimation is requisite to build 3D perceiving capability of artificial intelligence of things (AIoT). Real-time inference with extremely low computing resource consumption is critical on edge devices. However, most single-view depth estimation networks focus on the improvement of accuracy running on high-end GPUs, which go opposite to the real-time requirement on edge devices. To address this issue, this article proposed a novel encoder-decoder network to realize real-time monocular depth estimation on edge devices. The proposed network merges semantic information at global field via an efficient transformer-based module to provide more details of the object for depth assignment. The transformer-based module is integrated in the lowest level resolution of an encoder-decoder architecture to largely reduce the parameters of the Vision Transformer (ViT). In particular, we proposed a novel patch convolutional layer for low-latency feature extraction in the encoder and a SConv5 layer for effective depth assignment in the decoder. The proposed network achieves outstanding balance between accuracy and speed on the NYU Depth v2 dataset. A low RMSE of 0.554 and a fast speed of 58.98 FPS on NVIDIA Jetson Nano device with TensorRT optimization are obtained on NYU Depth v2, outperforming most state-of-the-art real-time results.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: IEEE Transactions on Instrumentation and Measurement
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.