Abstract

This study looks at the effect of data conditions on automated speech processing systems. The goal is to better understand the impact of acoustical features on accuracy and to develop more robust features. A speaker identification (SID) system was used for the experiments. To explore this issue, a new longitudinal database was collected involving 60 speakers over 18 months. This corpus allowed us to examine four data factors that impact SID: (1) intersession variability, (2) question intonation (3), text-dependency (identical phonetic content), and (4) whispered speech. First, we found that intersession SID suffered an average loss in accuracy of 17% independent of the time latency between sessions. Second, differing the intonation conditions in train and test hurt SID performance by 5%. Third, text-dependent data showed the most dramatic impact, where using phonetically-identical test and train sentences yielded 0% error. However, replacing the target speakers content with a random text-independent sentence, ceteris paribus, caused accuracy to plummet 94%. In all of the erroneous identifications, the top-ranked speaker was speaking the sentence used for the training model. Finally, when there was a mismatch between whispered speech being used in training and normally phonated speech in testing, the SID performance was severely impacted.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call