A speech separation system in video sequence using dilated inception network and U-Net

Ghada Dahy,Mohammed A.A Refaey,Reda Alkhoribi,M Shoman

doi:10.1016/j.eij.2022.09.001

Ghada Dahy, Mohammed A.A Refaey + Show 2 more

https://doi.org/10.1016/j.eij.2022.09.001

Copy DOI

Export

Save

Cite

Journal: Egyptian Informatics Journal	Publication Date: Sep 29, 2022
Citations: 1	License type: cc-by-nc-nd

Affiliation: Cairo University

Abstract
Full-Text
Similar Papers

Abstract

Listen

In this paper, an audio-visual model for separating a speech of the target speaker from a combination of other speakers’ speeches is proposed. It can be used in speech separation, automatic speech recognition systems (ASR) and also in creating single speaker speech databases. Speech separation is complicated problem using audio information only so visual and auditory signals are combined to complete the separation process. The proposed model consists of four modules, two for audio signal, one for visual feature and the last one used to concatenate the features resulted from the previous three modules to generate the separated signals. Our proposed model improved Short-time objective intelligibility (STOI) with 11%, Perceptual Evaluation of Speech Quality (PESQ) with 24%, and Frequency-weighted Segmental SNR (fwSNRseg) with 16% compared with previous works. It also improved Csig' which is the predicted rating of speech distortion with 13% and 'Covl' which is the predicted rating of overall quality with 18% compared with previous audio-visual models.

Full Text