Abstract

The traditional Mel Frequency Cepstral Coefficient (MFCC) feature can only reflect low frequency information, ignoring high frequency signals. In order to solve this problem, this paper uses the Mel frequency decomposition to decompose the speech signal in the wavelet packet, and then obtains many valuable feature information in the high frequency signal, such as short-time energy, fundamental frequency, formant, MFCC, sub-band energy. These features are extremely important parameters in speech signals. In this paper, the short-time energy is obtained by the energy formula, the pitch period and the fundamental frequency are obtained from the autocorrelation function, and the formant frequency characteristics are characterized by Hilbert-Huang transform. The speech signal is continuous in the time domain, so the features extracted from each frame reflect only the emotional features in a single frame. In order for features to better reflect temporal continuity, LSTM is used to increase the information between adjacent frames. LSTM is naturally suitable for speech recognition due to its ability to take advantage of dynamically changing time information.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call