Abstract

This paper introduces a new methodology aimed at comfort for the driver in-the-wild multimodal corpus creation for audio-visual speech recognition in driver monitoring systems. The presented methodology is universal and can be used for corpus recording for different languages. We present an analysis of speech recognition systems and voice interfaces for driver monitoring systems based on the analysis of both audio and video data. Multimodal speech recognition allows using audio data when video data are useless (e.g. at nighttime), as well as applying video data in acoustically noisy conditions (e.g., at highways). Our methodology identifies the main steps and requirements for multimodal corpus designing, including the development of a new framework for audio-visual corpus creation. We identify the main research questions related to the speech corpus creation task and discuss them in detail in this paper. We also consider some main cases of usage that require speech recognition in a vehicle cabin for interaction with a driver monitoring system. We also consider other important use cases when the system detects dangerous states of driver's drowsiness and starts a question-answer game to prevent dangerous situations. At the end based on the proposed methodology, we developed a mobile application that allows us to record a corpus for the Russian language. We created RUSAVIC corpus using the developed mobile application that at the moment a unique audiovisual corpus for the Russian language that is recorded in-the-wild condition.

Highlights

  • Last years, modern smartphones became perspective multifunctional powerful devices intended for calls and text messages, and for a variety of different tasks including informational, multimedia, productivity, safety, lifestyle, accessibility related applications, and many others

  • RELATED WORK We consider modern research has been done in the following main topics that are related to the problem domain: driver monitoring systems based on smartphone sensors, speech interfaces in the vehicle cabin, and available audio-visual speech recognition corpora recorded in the vehicle cabin

  • RA1: According to the conducted analysis, the stateof-the-art methodology to tackle the problem of audiovisual speech recognition in a vehicle environment is usually based either on discrete cosine transform (DCT) coefficientsbased or on active appearance models-based (AAM) features extraction, followed by Hidden Markov models (HMM), used for the classification

Read more

Summary

INTRODUCTION

Modern smartphones became perspective multifunctional powerful devices intended for calls and text messages, and for a variety of different tasks including informational, multimedia, productivity, safety, lifestyle, accessibility related applications, and many others. We implemented a related work analysis in the topic of driver monitoring systems based on smartphone sensors to identify the main scenarios the designed speech recognition system should support. We review speech recognition approaches based on audio and video data that allow us to identify metaparameters we should support our corpus creation. We define vocabulary both to support voice commands the driver uses to interact with the system and to support dialog-based question/answer games that the system proposes to the driver in order to detect the dangerous drowsiness state. The conclusion summarizes the paper and contains the main discussion of the results

RELATED WORK
MULTIMODAL CORPORA FOR AUDIO-VISUAL SPEECH RECOGNITION IN VEHICLE CABIN
DISCUSSION
Findings
VIII. CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call