Abstract

This work explores the utility of the Hilbert Spectrum (HS) of the speech signal, constructed from its AM-FM components, or Intrinsic Mode Functions (IMFs), in characterizing speakers for the task of Text-Dependent Speaker Verification (TDSV). The IMFs of the speech signal are obtained using a non-linear and non-stationary data analysis technique called Modified Empirical Mode Decomposition (MEMD). The HS, which is a representation of the instantaneous frequencies and instantaneous energies of the IMFs, is processed in short time-segments to generate features, which are then experimented for the task of TDSV. Two databases – the RSR2015 and the IITG – are utilized in this work, for validating the experimental findings. The performances of the TDSV system are evaluated for the individual features, and their combinations with the 39-dimensional Mel Frequency Cepstral Coefficients (MFCCs). To assess the practical utility of the features, they are tested not only for clean speech, but also for speech corrupted by low-frequency (Babble) noise, and environmental noise. The experiments reveal that the features obtained from the HS, in combination with the MFCCs, enhances the performance of the TDSV system. Further, the features extracted are effective at very low dimensions. Moreover, the features extracted from the HS are found to be consistently more effective than cepstral/energy feature obtained from the raw IMFs, under noisy conditions.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call