On the robustness of vision transformers for in-flight monocular depth estimation

Simone Ercolino,Alessio Devoto,Silvio Mazzaro,Matteo Santini,Simone Scardapane,Luca Monorchio

doi:10.1007/s44244-023-00005-3

Simone Ercolino, Alessio Devoto + Show 4 more

Open Access

https://doi.org/10.1007/s44244-023-00005-3

Copy DOI

Abstract

Monocular depth estimation (MDE) has shown impressive performance recently, even in zero-shot or few-shot scenarios. In this paper, we consider the use of MDE on board low-altitude drone flights, which is required in a number of safety-critical and monitoring operations. In particular, we evaluate a state-of-the-art vision transformer (ViT) variant, pre-trained on a massive MDE dataset. We test it both in a zero-shot scenario and after fine-tuning on a dataset of flight records, and compare its performance to that of a classical fully convolutional network. In addition, we evaluate for the first time whether these models are susceptible to adversarial attacks, by optimizing a small adversarial patch that generalizes across scenarios. We investigate several variants of losses for this task, including weighted error losses in which we can customize the design of the patch to selectively decrease the performance of the model on a desired depth range. Overall, our results highlight that (a) ViTs can outperform convolutive models in this context after a proper fine-tuning, and (b) they appear to be more robust to adversarial attacks designed in the form of patches, which is a crucial property for this family of tasks.

Full Text