Combining key pronunciation detection, frontal lip reconstruction, and time-delay for audio-visual consistency judgment

Zhengyu Zhu,Chao Luo,Liping Liao,Pei Lin,Yao Li

doi:10.1016/j.dsp.2023.104272

Abstract

This paper presents a novel approach to audio-visual consistency judgment (AVCJ), in which vowel-like regions exhibiting significant changes in lip shape are used as key pronunciations. Unlike conventional methods that typically apply speech information to the whole utterance and corresponding visual features, the proposed technique involves key pronunciation analysis and frontal lip reconstruction for non-frontal lip frames corresponding to these key pronunciations. This is done to eliminate the influence of variations in video viewing angle on consistency judgment and address the limitations of whole-sentence analysis in a process consisting of four steps. First, key pronunciations are detected in speech utterance. Second, view angle classification is performed on lip frames corresponding to the key pronunciations and then frontal reconstruction is performed by a proposed SL-CycleGAN. Third, deep correlation analysis is applied to the covariance of the deep features obtained from both audio and visual features. Fourth, time-delay and correlation differences are combined to estimate the joint score of each key pronunciation. The results of AVCJ are then acquired by calculating a sentence level score from each joint score. Experimental results from the OuluVS2 bimodal multi-view database showed that the proposed method outperformed several state-of-the-art algorithms, including quadratic mutual information (QMI), space-time canonical correspondence analysis, and multiple variations of SyncNet.

Full Text