Abstract

We propose a method for obtaining only utterance voice from among various kinds of sounds in real environment. The method is based on the fusion of audio and visual information. It is difficult to obtain only specific utterance voice in real environment using single mode information on either audio or visual. In the fields of human interface, specific voice acquisition technique in real environment needs to develop for hands free systems. The proposed method combines sound source separation and image-processing which estimate a speaker's location by color transformation and a proposed filter. We use a time - frequency masking as sound blind source separation, and image-processing to extract a speaker's lip region. Separated voice and sound source directions are obtained by microphone array, and a speaker's direction is obtained by video camera. We consider the sound's direction that contains lip region as a speaker's direction. By fusing information obtained from above methods, we could get only utterance voice in real environment. The proposed method is applied in real environment and results of the experiment are shown.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call