Abstract

A technique for the early fusion of visual lip movements and a vector of mixed speech signals is proposed. This technique involves the initial recreation of speech signals entirely from the visual lip motions of each speaker. By using geometric parameters of the lips obtained from the Tulips1 database and the Audio–Visual Speech Processing dataset, a virtual speech signal is recreated by using audiovisual training segments as a basis for the recreation. It is shown that the visually created speech signal has an envelope that is directly related to the envelope of the original acoustic signal. This visual signal envelope reconstruction is then used to aid in the robust separation of the mixed speech signals by using the envelope information to identify the vocally active and silent periods of each speaker. It is shown that, unlike previous signal separation techniques, which required an ideal mixture of independent signals, the mixture coefficients can be very accurately estimated using the proposed technique in even non-ideal situations. For example, in the presence of speech noise, the mixing coefficients can be correctly estimated with signal-to-noise ratios (SNRs) as low as 0 dB, while in the presence of Gaussian noise, the estimation can be accurately done with SNRs as low as 10 dB.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call