Abstract

Pitch is an important characteristic of speech and is useful for many applications. However, it is still challenging to estimate pitch in strong noise. In this paper, we propose a joint training approach to determinate pitch. First, a Bidirectional Long Short-Term Memory Recurrent Neural Networks (BLSTMRNN) is trained to map the noisy to clean speech features. Second, the pitch estimation is also a BLSTM-RNN model. The feature mapping neural network serves as a noise normalization module aiming at explicitly generating the clean features which are easier to estimate pitch by the following neural network. BLSTM-RNN is trained on sequential frame-level features and capable of learning temporal dynamics. We also propose to take into account bottleneck features for pitch estimation. The experimental results show that the proposed method can obtain accurate pitch estimation and they show good generalization ability to new speakers and noisy conditions. The proposed approach also significantly outperforms other state-of-the-art pitch estimation algorithms.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call