Abstract

Voice user interfaces and assistants are rapidly entering our lives and becoming singular touchpoints spanning our devices. Raw audio signals collected through these devices contain a host of sensitive paralinguistic information (e.g., emotional patterns) that is transmitted to service providers regardless of deliberate or false triggers. We thus encounter a new generation of privacy risks by using these services. To tackle this issue, we have developed EDGY; a configurable, lightweight, disentangled representation learning framework that transforms and filters high-dimensional voice data to identify and selectively filter sensitive attributes at the edge prior to offloading to the cloud. Our results show that EDGY runs in tens of milliseconds with 0.2% relative improvement in ABX score and minimal performance penalties in learning linguistic representations from raw signals on a CPU and single-core ARM processor with no specialized hardware.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call