Abstract

Smart speakers with voice assistants like Google Home or Amazon Alexa are increasingly popular and essential in our daily lives due to their convenience of issuing voice commands. Ensuring that these voice assistants are equitable to different population subgroups is crucial. In this paper, we present the first framework, AudioAcc, to help evaluate the performance of various accents. AudioAcc takes in videos from YouTube and generates composite commands. We further propose a new metric called Consistency of Results (COR) that developers can use to avoid the incorrect translation of the produced results by rewriting the skill to improve the Word Error Rate (WER) performance. We evaluate AudioAcc on complete sentences extracted from YouTube videos. The result reveals that our composite sentences generated by AudioAcc are close to the complete sentences. Our evaluation of diverse audiences shows that first, speech from native speakers, particularly Americans, exhibits the best WER performance by 9.5% in comparison to speech from other native and nonnative speakers. Second, speech from American professional speakers has significantly more fairness and the best WER performance by 8.3% in comparison to speech from German professional speakers and German and American amateur speakers. Moreover, we show that using the COR metric could help developers to rewrite the skill to improve the WER accuracy, which we used to improve the accuracy of the Russian accent.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call