Abstract

In this paper we discuss the role of fundamental frequency f0 and formants F1, F2 and F3 of the speech signal in supervised and unsupervised source separation of real recorded convolutive speech mixtures. Initially supervised source separation is discussed where it is assumed that sources are known a priori. The supervised source separation is discussed by considering (1) only fundamental frequency f0, (2) only formants F1, F2 and F3, (3) both f0 and formants F1, F2 and F3. It is observed that last case which involves both f0 and formants gives most accurate separation results and is used as ideal case or reference to compare the separation results obtained for unsupervised source separation. The unsupervised source separation is discussed, where there is no knowledge about the sources a priori. The unsupervised source separation is discussed using (1) cross correlation of formants of different frames along with f0 and (2) standard deviation of magnitude of frequency components in F1, F2 and F3 regions of the spectrogram. It is observed that separation results obtained using both unsupervised methods are very close to the ideal case in supervised source separation. The results show that this method works better than some of the classical blind source separation algorithms like independent component analysis and non negative matrix factorization which works well only for the case of instantaneous mixtures where delay is neglected.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call