Abstract
This paper proposes a method for matching unfamiliar person's face to unfamiliar voice. The idea behind this is crossmodal perception of human including many illusions such as the McGurk effect, ventriloquist illusion, and so on. Especially, we focus on recent psychological evidence suggesting human can do matching between unfamiliar faces and unfamiliar voices to some extent. The aim of this paper is to mimic this ability on a computer. In order to realize the matching of an unfamiliar person's face to an unfamiliar voice, a dataset of pairs of facial images and corresponding voices are used as knowledge. It means that the unfamiliar voice is matched to the closest known speaker model. Since the database contains corresponding facial image, the system can estimate a closest known face from the unfamiliar voice. Finally each unfamiliar face is matched to the estimated known face and the final recognition result is obtained. To this end, we first implement a speaker recognition system based on Mel Frequency Cepstral Coefficients as the speech feature and Gaussian mixtures models as the classifier. We also use a two-dimensional HMM-based face recognizer and propose a statistical integration of audio/visual recognition results. To show the possibility of the proposed system, unfamiliar speaker recognition experiments are carried out using 60 sentences from the ATR-503 sentences uttered by 20 university students.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.