Robust speech processing for voice interfaces and video content understanding at Facebook

Michael Seltzer

doi:10.1121/10.0008017

Abstract

As voice interfaces to devices and digital assistants have increased in popularity, so too, have the challenging environments in which they are expected to perform. In this talk, we’ll present an overview of the signal processing and speech recognition AI modeling techniques that we have developed at Facebook to enable robust voice interaction on Portal video calling devices and Oculus VR headsets. We will also describe progress in captioning and understanding the wide variety of video content shared on Facebook apps, where the acoustic conditions are diverse and challenging and the audio is typically captured on commodity mobile phones. While such systems have been historically developed to run on powerful servers in the cloud, there is increasing interest in speech models that can run locally on the client device. We will describe the challenges of on-device processing and our recent progress in creating efficient, low-footprint speech models. Finally, we will present the challenges and future directions we are exploring to enable rich voice interactions on the next generation of computing devices, including augmented reality glasses.

Full Text