Abstract

In this paper, we exploit the non-linear relation between a speech source and its associated lip video as a source of extra information to propose an improved audio-visual speech source separation (AVSS) algorithm. The audio-visual association is modeled using a neural associator which estimates the visual lip parameters from a temporal context of acoustic observation frames. We define an objective function based on mean square error (MSE) measure between estimated and target visual parameters. This function is minimized for estimation of the de-mixing vector/filters to separate the relevant source from linear instantaneous or time-domain convolutive mixtures. We have also proposed a hybrid criterion which uses AV coherency together with kurtosis as a non-Gaussianity measure. Experimental results are presented and compared in terms of visually relevant speech detection accuracy and output signal-to-interference ratio (SIR) of source separation. The suggested audio-visual model significantly improves relevant speech classification accuracy compared to existing GMM-based model and the proposed AVSS algorithm improves the speech separation quality compared to reference ICA- and AVSS-based methods.

Highlights

  • Audio-visual speech source separation (AVSS) is a growing field of research that is developed in recent years

  • 3.3.2 Hybrid video coherent and independent component analysis Contrary to the previous section where a sequential and loose combination of Independent component analysis (ICA) and AV coherency model was considered, here we propose a parallel and tight combination using a hybrid criterion which benefits from normalized kurtosis as a statistical independence measure in conjunction with the AV coherency measure

  • We propose a hybrid criterion based on combination of the AV criterion (8) and the normalized kurtosis: Jav ICA(B; xτ, Vτ ) = Jav multi-layer perceptron (MLP)(B; xτ, Vτ ) − λkurtn(Bxτ ) (14)

Read more

Summary

Introduction

Audio-visual speech source separation (AVSS) is a growing field of research that is developed in recent years. Rivet et al [16] have adopted the AV coherency of speech (measured by a trained log-Rayleigh distribution) for resolving the permutation indeterminacy in the frequency domain separation of convolutive mixtures They have proposed another method [11] for convolutive AVSS based on developing a VVAD and using it in a geometric separation algorithm using sparse source assumption. Most AVSS algorithms work based on maximization of AV coherency between unmixed signals y and their corresponding video streams It is shown in [12] that given coarse spectral envelope of sources, one can solve a system of equations for calculation of de-mixing matrix in regular mixtures. It is clear that in case of existence of multiple video streams corresponding to more than one speech sources, all the described methods can be repeated for each video stream

Audio-visual speech source separation using MLP AV modeling
Toward a time domain AVSS for convolutive mixtures
Separation performance criterion
Findings
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call