Abstract

In this paper, an audio-visual model for separating a speech of the target speaker from a combination of other speakers’ speeches is proposed. It can be used in speech separation, automatic speech recognition systems (ASR) and also in creating single speaker speech databases. Speech separation is complicated problem using audio information only so visual and auditory signals are combined to complete the separation process. The proposed model consists of four modules, two for audio signal, one for visual feature and the last one used to concatenate the features resulted from the previous three modules to generate the separated signals. Our proposed model improved Short-time objective intelligibility (STOI) with 11%, Perceptual Evaluation of Speech Quality (PESQ) with 24%, and Frequency-weighted Segmental SNR (fwSNRseg) with 16% compared with previous works. It also improved Csig' which is the predicted rating of speech distortion with 13% and 'Covl' which is the predicted rating of overall quality with 18% compared with previous audio-visual models.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call