Abstract

In this study, the authors discuss unsupervised separation of two speakers from single microphone recording using empirical mode decomposition (EMD) and Hilbert transform (HT) generally known as Hilbert-Huang transform. A two-stage separation procedure is proposed for single-channel (SC) speech separation. Initial stage of separation is done using EMD, HT and instantaneous frequencies. EMD decomposes the mixed signal into oscillatory functions known as intrinsic mode functions (IMFs). Suitable IMFs are selected using successive EMD decomposition and HT is applied to extract the instantaneous frequencies. The speech frames are grouped into two speakers using correlation of instantaneous frequencies between mixed signal and selected IMFs. Second-stage separation is done by further decomposing the estimated speakers into IMFs and finding the instantaneous amplitudes using HT. A ratio of instantaneous amplitudes of mixed speech and stage 1 recovered speech signal is computed for both speakers. Histogram of the ratio obtained can be used to estimate the ideal binary mask for each speaker. These masks are applied to the speech mixture and the underlying speakers are estimated. The proposed method was compared with the existing unsupervised SC source separation algorithms. The results show significant improvement in objective measures.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call