Abstract

A technique to virtually recreate speech signals entirely from the visual lip motions of a speaker is proposed. By using six geometric parameters of the lips obtained from the Tulips1 database, a virtual speech signal is recreated by using a 3.6s audiovisual training segment as a basis for the recreation. It is shown that the virtual speech signal has an envelope that is directly related to the envelope of the original acoustic signal. This visual signal envelope reconstruction is then used as a basis for robust speech separation where all the visual parameters of the different speakers are available. It is shown that, unlike previous signal separation techniques, which required an ideal mixture of independent signals, the mixture coefficients can be very accurately estimated using the proposed technique in even non-ideal situations.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call