Abstract

Many systems using microphone arrays have been tried in order to localize sound sources. Conventional techniques, such as MUSIC, CSP, and so on (e.g., (Johnson & Dudgeon, 1996; Omologo & Svaizer, 1996; Asano et al., 2000; Denda et al., 2006)), use simultaneous phase information from microphone arrays to estimate the direction of the arriving signal. There have also been studies on binaural source localization based on interaural differences, such as interaural level difference and interaural time difference (e.g., (Keyrouz et al., 2006; Takimoto et al., 2006)). However, microphone-array-based systems may not be suitable in some cases because of their size and cost. Therefore, single-channel techniques are of interest, especially in small-device-based scenarios. The problem of single-microphone source separation is one of the most challenging scenarios in the field of signal processing, and some techniques have been described (e.g., (Kristiansson et al., 2004; Raj et al., 2006; Jang et al., 2003; Nakatani & Juang, 2006)). In our previous work (Takiguchi et al., 2001; Takiguchi & Nishimura, 2004), we proposed HMM (Hidden Markov Model) separation for reverberant speech recognition, where the observed (reverberant) speech is separated into the acoustic transfer function and the clean speech HMM. Using HMM separation, it is possible to estimate the acoustic transfer function using some adaptation data (only several words) uttered from a given position. For this reason, measurement of impulse responses is not required. Because the characteristics of the acoustic transfer function depend on each position, the obtained acoustic transfer function can be used to localize the talker. In this paper, wewill discuss a new talker localizationmethod using only a singlemicrophone. In our previous work (Takiguchi et al., 2001) for reverberant speech recognition, HMM separation required texts of a user’s utterances in order to estimate the acoustic transfer function. However, it is difficult to obtain texts of utterances for talker-localization estimation tasks. In this paper, the acoustic transfer function is estimated from observed (reverberant) speech using a clean speech model without having to rely on user utterance texts, where a GMM (Gaussian Mixture Model) is used to model clean speech features. This estimation is performed in the cepstral domain employing an approach based upon maximum likelihood. This is possible because the cepstral parameters are an effective representation for retaining useful clean speech information. The results of our talker-localization experiments show the effectiveness of our method. Single-Channel Sound Source Localization Based on Discrimination of Acoustic Transfer Functions 3

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call