Abstract

Speech technology is firmly rooted in daily life, most notably in command-and-control (C&C) applications. C&C usability downgrades quickly, however, when used by people with non-standard speech. We pursue a fully adaptive vocal user interface (VUI) which can learn both vocabulary and grammar directly from interaction examples, achieving robustness against non-standard speech by building up models from scratch. This approach raises feasibility concerns on the amount of training material required to yield an acceptable recognition accuracy. In a previous work, we proposed a VUI based on non-negative matrix factorisation (NMF) to find recurrent acoustic and semantic patterns comprising spoken commands and device-specific actions, and showed its effectiveness on unimpaired speech. In this work, we evaluate the feasibility of a self-taught VUI on a new database called domotica-3, which contains dysarthric speech with typical commands in a home automation setting. Additionally, we compare our NMF-based system with a system based on Gaussian mixtures. The evaluation favours our NMF-based approach, yielding feasible recognition accuracies for people with dysarthric speech after a few learning examples. Finally, we propose the use of a multi-layered semantic frame structure and demonstrate its effectiveness in boosting overall performance.

Highlights

  • Modern voice control technology is available in many applications such as direct voice input (DVI) in aviation [1], information requests using Siri and speech-driven home automation

  • Experiments The goal of the experiments is twofold: first, we test the feasibility of our vocal user interface (VUI) by evaluating the performance of the framework using the F-score on slot value recognition as defined in [18]; we investigate the added value of using a more layered semantic frame structure on two datasets: PATCOR containing commands having a complex grammar and DOMOTICA-3 containing realistic recordings of commands from speech-impaired speakers in the setting of a virtual home automation system

  • When we compare Gaussian mixture model (GMM)-based learning with negative matrix factorisation (NMF)-based learning in the upper panel, we observe steeper learning curves for the NMF-based learning for the group of severely dysarthric speakers, yielding an average improvement of 23% (t(159) = 30.2, p < 0.001)

Read more

Summary

Introduction

Modern voice control technology is available in many applications such as direct voice input (DVI) in aviation [1], information requests using Siri and speech-driven home automation. Command-and-control (C&C) appliances afford hands-free control, enhancing the independence of the physically incapacitated. Speech commands are sometimes misinterpreted when words overstep lexical boundaries and word sequences do not fit the preset grammars. C&C appliances frequently fail to interpret dialectic or impaired speech, often encountered with physically challenged people. People with non-standard speech are increasingly excluded from the growing market of voice-driven applications. The goal of this work is to investigate a vocal user interface (VUI) model which is able to learn words and grammars from end users, improving accessibility of C&C applications

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call