The attacker can generate adversarial examples (AEs) to stealthily mislead automatic speech recognition (ASR) models, raising significant concerns about the security of intelligent voice control (IVC) devices. Existing adversarial attacks mainly generate AEs to mislead ASR models to output specific target English commands (e.g., open the door). However, it remains unknown whether AEs can be used to issue commands in other languages to attack commercial black-box ASR models. In this paper, taking Chinese phrases (e.g., 支付宝付款) and “Chinese-English code-switching” phrases (e.g., 关闭GPS) as the target commands, we propose adversarial attacks for commercial multilingual ASR models. In particular, if a multilingual speech recognition model can recognize Chinese and English, we call it a Chinese-English speech recognition model. In English, the meaning of “支付宝付款” and “关闭GPS” are “Alipay payment” and “turn off GPS”, respectively. In detail, we generate transferable AEs based on the open-sourced conventional DataTang Mandarin ASR model. Given 55 target commands, the success rate for generating AEs of them is up to 96% and 80% for Aliyun ASR API and Tencentyun ASR API, respectively. Our AEs can trigger actual attack actions on voice assistants (e.g., Apple Siri, Xiaomi Xiaoaitongxue) or spread malicious messages through ASR API services, while the target commands in the AEs are inaudible to human beings. Finally, by analyzing the spectrum differences between benign audio clips and AEs, we propose a general defense against adversarial audio attacks.
Read full abstract