Abstract
Aimed at improving size, weight, and power (SWaP)-constrained robotic vision-aided state estimation, we describe our unsupervised, deep convolutional-deconvolutional sensor fusion network, Multi-Hypothesis DeepEfference (MHDE). MHDE learns to intelligently combine noisy heterogeneous sensor data to predict several probable hypotheses for the dense, pixel-level correspondence between a source image and an unseen target image. We show how our multi-hypothesis formulation provides increased robustness against dynamic, heteroscedastic sensor and motion noise by computing hypothesis image mappings and predictions at 76–357 Hz depending on the number of hypotheses being generated. MHDE fuses noisy, heterogeneous sensory inputs using two parallel, inter-connected architectural pathways and n (1–20 in this work) multi-hypothesis generating sub-pathways to produce n global correspondence estimates between a source and a target image. We evaluated MHDE on the KITTI Odometry dataset and benchmarked it against the vision-only DeepMatching and Deformable Spatial Pyramids algorithms and were able to demonstrate a significant runtime decrease and a performance increase compared to the next-best performing method.
Highlights
Due to both the speed and quality of their sensors and restrictive on-board computational capabilities, current state-of-the-art (SOA) size, weight, and power (SWaP) constrained autonomous robotic systems must trade operational speed for safety and robustness
While increased performance in the noise-free conditions was an unintended consequence of the multi-hypothesis formulation, the central contribution of this work is in the handling of noise-contaminated input data
We have shown the unsupervised learning of correspondence between static grayscale images in a deep sensorimotor fusion network with noisy sensor data
Summary
Due to both the speed and quality of their sensors and restrictive on-board computational capabilities, current state-of-the-art (SOA) size, weight, and power (SWaP) constrained autonomous robotic systems must trade operational speed for safety and robustness. This trade-off is especially pronounced in dynamic, GPS-/communications-denied environments where robust navigation must be performed only with on-board sensors and computational resources. For VO, a crucial early-step is finding a correspondence mapping between scene elements perceived on the imaging plane in sequentially captured image frames. This is referred to as the correspondence problem (see [2] for a review)
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have