Is the Same Performance Really the Same?: Understanding How Listeners Perceive ASR Results Differently According to the Speaker's Accent

Seoyoung Kim,Yeon Su Park,Dakyeom Ahn,Jin Myung Kwak,Juho Kim

doi:10.1145/3641008

Abstract

Research suggests that automatic speech recognition (ASR) systems, which automatically convert speech to text, show different performances according to various input classes (e.g., accent, age), requiring attention to building fairer AI systems that would perform similarly across various input classes. However, would an AI system with the same performance regardless of input classes really be perceived as fair enough? To this end, we investigate how listeners perceive the ASR system of the same result differently according to whether the speaker is a native speaker (NS) or a non-native speaker (NNS), which may lead to unfair situations. We conducted a study (n = 420), where participants were given one of the ten speech recordings with various accents of the same script along with the same captions. We found that even with the same ASR output, listeners perceive the ASR results differently. They found captions to be more useful for NNS's speech and blamed NNS more for the errors than NS. Based on the findings, we present design implications suggesting that we should take a step further than just achieving the same performance across various input classes to build a fair ASR system.

Full Text