Abstract

Voice user interfaces and digital assistants are rapidly entering our lives and becoming singular touch points spanning our devices. These always-on services capture and transmit our audio data to powerful cloud services for further processing and subsequent actions. Our voices and raw audio signals collected through these devices contain a host of sensitive paralinguistic information that is transmitted to service providers regardless of deliberate or false triggers. As our emotional patterns and sensitive attributes like our identity, gender, and well-being are easily inferred using deep acoustic models, we encounter a new generation of privacy risks by using these services. One approach to mitigate the risk of paralinguistic-based privacy breaches is to exploit a combination of cloud-based processing with privacy-preserving, on-device paralinguistic information learning and filtering before transmitting voice data. In this article we introduce EDGY , a configurable, lightweight, disentangled representation learning framework that transforms and filters high-dimensional voice data to identify and contain sensitive attributes at the edge prior to offloading to the cloud. We evaluate EDGY’s on-device performance and explore optimization techniques, including model quantization and knowledge distillation, to enable private, accurate, and efficient representation learning on resource-constrained devices. Our results show that EDGY runs in tens of milliseconds with 0.2% relative improvement in “zero-shot” ABX score or minimal performance penalties of approximately 5.95% word error rate (WER) in learning linguistic representations from raw voice signals, using a CPU and a single-core ARM processor without specialized hardware.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call