Abstract

Unmanned aerial vehicles (UAVs) are becoming widespread with applications ranging from film-making and journalism to rescue operations and surveillance. Research communities (speech processing, computer vision, control) are starting to explore the limits of UAVs, but their efforts remain somewhat isolated. In this paper we unify multiple modalities (speech, vision, language) into a speech interface for UAV control. Our goal is to perform unconstrained speech recognition while leveraging the visual context. To this end, we introduce a multimodal evaluation dataset, consisting of spoken commands and associated images, which represent the visual context of what the UAV “sees” when the pilot utters the command. We provide baseline results and address two main research directions. First, we investigate the robustness of the system by (i) training it with a partial list of commands, and (ii) corrupting the recordings with outdoor noise. We perform a controlled set of experiments by varying the size of the training data and the signal-to-noise ratio. Second, we look at how to incorporate visual information into our model. We show that we can incorporate visual cues in the pipeline through the language model, which we implemented using a recurrent neural network. Moreover, by using gradient activation maps the system can provide visual feedback to the pilot regarding the UAV’s understanding of the command. Our conclusions are that multimodal speech recognition can be successfully used in this scenario and that visual information helps especially when the noise level is high. The dataset and our code are available at http://kite.speed.pub.ro.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call