Abstract
This paper presents a pitch-detection algorithm (PDA) for application to signals containing continuous speech. The core of the method is based on merged normalized forward-backward correlation (MNFBC) working in the time domain with the ability to make basic voicing decisions. In addition, the Viterbi traceback procedure is used for post-processing the MNFBC output considering the three best fundamental frequency (F0) candidates in each step. This should make the final pitch contour smoother, and should also prevent octave errors. In transition probabilities computation between F0 candidates, two major improvements were made over existing post-processing methods. Firstly, we compare pitch distance in musical cent units. Secondly, temporal forgetting is applied in order to avoid penalizing pitch jumps after prosodic pauses of one speaker or changes in pitch connected with turn-taking in dialogs. Results computed on a pitchreference database definitely show the benefit of the first improvement, but they have not yet proved any benefits of temporal modification. We assume this only happened due to the nature of the reference corpus, which had a small amount of suprasegmental content.
Highlights
Almost every audible sound tends to have a fundamental frequency
MNBFCv1 is the basic variant with the voiced/unvoiced (V/UV) decision threshold set to value 0.5 and with the transition probability of the Viterbi procedure computed from the direct frequency difference
To compare our method with other widely used methods, we added the results for autocorrelation in the frequency domain (ACF freq, a very good method for tracking singing) and the Direct Frequency Estimation method (DFE) [8], which is currently used for evaluating Parkinson’s disease at FEE CTU in Prague
Summary
Almost every audible sound tends to have a fundamental frequency. This is the lowest frequency on which the signal is periodic, and we sense this frequency as the height (pitch) of the sound. Human speech perception is partly based on intonation (changes of pitch), which is an aspect of prosody. Thanks to this we can distinguish whether a person is making a statement or a question [1]. A motivation for finding a precise and robust PDA could be to track the intonation contour in continuous speech This is a crucial step for the proper function e.g. of a punctuation detector [2] or an emotion classifier of the speaker. There are nowadays several known pitch detection methods They can generally be divided according to the domain in which they operate (time, frequency, cepstrum, etc.) An overview of some basic methods can be found in [12]. AMDF [5] (time domain), the cepstral method [4] (modification of the spectrum domain) and sub-harmonic summation (SHS) [3] are well described and widely used methods
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.