Audio-visual Corpus Research Articles

The analysis of lectures and meetings inside smart rooms has recently attracted much interest in the literature, being the focus of international projects and technology evaluations. A key enabler for progress in this area is the availability of appropriate multimodal and multi-sensory corpora, annotated with rich human activity information during lectures and meetings. This paper is devoted to exactly such a corpus, developed in the framework of the European project CHIL, “Computers in the Human Interaction Loop”. The resulting data set has the potential to drastically advance the state-of-the-art, by providing numerous synchronized audio and video streams of real lectures and meetings, captured in multiple recording sites over the past 4 years. It particularly overcomes typical shortcomings of other existing databases that may contain limited sensory or monomodal data, exhibit constrained human behavior and interaction patterns, or lack data variability. The CHIL corpus is accompanied by rich manual annotations of both its audio and visual modalities. These provide a detailed multi-channel verbatim orthographic transcription that includes speaker turns and identities, acoustic condition information, and named entities, as well as video labels in multiple camera views that provide multi-person 3D head and 2D facial feature location information. Over the past 3 years, the corpus has been crucial to the evaluation of a multitude of audiovisual perception technologies for human activity analysis in lecture and meeting scenarios, demonstrating its utility during internal evaluations of the CHIL consortium, as well as at the recent international CLEAR and Rich Transcription evaluations. The CHIL corpus is publicly available to the research community.

The aim of this paper is to study how contrastive focus is conveyed by prosody both articulatorily and acoustically and how viewers extract focus structure from visual prosodic realizations. Is the visual modality useful for the perception of prosody? An audiovisual corpus was recorded from a male native speaker of French. The sentences had a subject–verb–object (SVO) structure. Four contrastive focus conditions were studied: focus on each phrase (S, V or O) and broad focus. Normal and reiterant modes were recorded, only the latter was studied. An acoustic validation (fundamental frequency, duration and intensity) showed that the speaker had pronounced the utterances with a typical focused intonation on the focused phrase. Then, lip height and jaw opening were extracted from the video data. An articulatory analysis suggested a set of possible visual cues to focus for reiterant /ma/ speech: (a) prefocal lengthening, (b) large jaw opening and high opening velocities on all the focused syllables; (c) long lip closure for the first focused syllable and (d) hypo-articulation (reduced jaw opening and duration) of the following phrases. A visual perception test was developed. It showed that (a) contrastive focus was well perceived visually for reiterant speech; (b) no training was necessary and (c) subject focus was slightly easier to identify than the other focus conditions. We also found that if the visual cues identified in our articulatory analysis were present and marked, perception was enhanced. This enables us to assume that the visual cues extracted from the corpus are probably the ones which are indeed perceptively salient.

Audio-visual Corpus Research Articles

Related Topics

Articles published on Audio-visual Corpus

On the Soft Fusion of Probability Mass Functions for Multimodal Speech Processing

A 3-D Audio-Visual Corpus of Affective Communication

A study of lip movements during spontaneous dialog and its application to voice activity detection

An ultrasound-based silent speech interface

The CHIL audiovisual corpus for lecture and meeting analysis inside smart rooms

An audio-visual corpus for speech perception and automatic speech recognition

Construction of Audio-Visual Speech Corpus Using Motion-Capture System and Corpus Based Facial Animation

Visual perception of contrastive focus in reiterant French speech

‘allora’ e ‘entonces’: problemi teorici e dati empirici

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Audio-visual Corpus Research Articles

Related Topics

Articles published on Audio-visual Corpus

On the Soft Fusion of Probability Mass Functions for Multimodal Speech Processing

A 3-D Audio-Visual Corpus of Affective Communication

A study of lip movements during spontaneous dialog and its application to voice activity detection

An ultrasound-based silent speech interface

The CHIL audiovisual corpus for lecture and meeting analysis inside smart rooms

An audio-visual corpus for speech perception and automatic speech recognition

Construction of Audio-Visual Speech Corpus Using Motion-Capture System and Corpus Based Facial Animation

Visual perception of contrastive focus in reiterant French speech

‘allora’ e ‘entonces’: problemi teorici e dati empirici