Abstract

This paper introduces a novel method to separate voiced frames of noisy speech signals into low-frequency or high-frequency. This separation improves the accuracy of fundamental frequency (F0) estimators. In this proposal, the target signal is analyzed by means of the ensemble empirical mode decomposition. Next, the pitch information is extracted from the first decomposition modes. This feature indicates the frequency region where the speech F0 should be located, thus separating the frames into low-frequency or high-frequency. The frames separation is then applied to correct pitch candidates extracted from a F0 detection method, improving the estimation accuracy. The proposed method and a baseline separation approach are evaluated considering four different F0 estimation algorithms. Experiments are conducted with the CSTR and TIMIT databases, and six noises with various signal-to-noise ratios. The Gross Error (GE) and Mean Absolute Error (MAE) metrics are adopted to evaluate the solutions in terms of F0 estimation errors. Results show that the proposed method outperforms the baseline, in terms of low/high frequency separation accuracy. Moreover, the novel solution is able to better improve F0 detection accuracy under different noisy conditions.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call