Abstract
The propagation delay difference of a speech signal transmitted from the source to microphones, also known as time difference of arrival (TDOA), embodies the information of speech source position. The TDOA estimation plays a vital role in diverse systems such as teleconferencing and far-field speech recognition since the TDOA is a key parameter impacting quality of restored speech signals. This paper is devoted to estimating the TDOA of one speech source on a frame by frame basis in noisy and anechoic environments. First, we propose two variants of Gaussian mixture model to represent the speech signal received by a microphone pair, assuming Gaussianity of the signal and modeling speech sparsity by the speech presence probability (SPP). Second, after estimating the noise parameter in advance and formulating the speech parameters using the maximum likelihood principle, the proposed Gaussian mixture models are reduced to being dependent only on two unknowns, i.e. TDOA and SPP. Third, following these two models, we present two distinct estimators to estimate the TDOA and the SPP iteratively based on the expectation maximization algorithm. The proposed two estimators are free from the ad hoc parameter selection which is required in many classical approaches. Simulation results show that the TDOA estimated by them could be more accurate than that of the state-of-the-art GCC variants in a wide range of frames with specific SPP values. More importantly, the automatically estimated SPP which can be served as voice activity detection in a soft manner encodes the information of the TDOA estimation accuracy. In a speech frame, the estimated SPP with a large value indicates the estimated TDOA with small error. For example, when the SPP is larger than 0.76 and 0.87 in the two proposed estimators, respectively, the TDOA estimation error could be at most 19% of that in the worst case.
Highlights
Speech source localization, i.e., determining the spatial position of a speech source, is a fundamental issue in adhoc acoustic sensor networks composed of distributed microphones [1]–[3], and it finds a growing interest in many applications such as teleconferencing [4], far-field speech recognition [5], surveillance [6], and so on
Many source localization approaches have been developed in recent decades, which can be classified into two groups, i.e., spatial spectrum and time-frequency (TF) processing, see Table 1
It results in four time difference of arrival (TDOA) values, that is, 7.0691×10−5 s, −1.4126×10−4 s, 0 s and 6.7436 × 10−4 s
Summary
I.e., determining the spatial position of a speech source, is a fundamental issue in adhoc acoustic sensor networks composed of distributed microphones [1]–[3], and it finds a growing interest in many applications such as teleconferencing [4], far-field speech recognition [5], surveillance [6], and so on. The mainstream localization applications focus either on time difference of arrival (TDOA) estimation [7] [8] or direction of arrival (DOA) estimation [9]. Many source localization approaches have been developed in recent decades, which can be classified into two groups, i.e., spatial spectrum and time-frequency (TF) processing, see Table 1. The spatial spectrum approaches construct a spectrum function of the spatial parameters (i.e., TDOA or DOA). The locations of highest peaks of the spectrum function indicate the TDOA (or DOA) candidates. The spectrum function can be constructed by methods like, e.g., the generalized crosscorrelation (GCC) algorithm [10] and subspace-based methods. The GCC function is expressed by inserting a weight
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.