Abstract
Background and Objectives: In dysprosodic speech, the prosody does not match the expected intonation pattern and can result in robotic-like speech, with each syllable produced with equal stress. These errors are manifested through inconsistent lexical stress as measured by perceptual judgments and/or acoustic variables. Lexical stress is produced through variations in syllable duration, peak intensity and fundamental frequency. The presented technique automatically evaluates the unequal lexical stress patterns Strong-Weak (SW) and Week-Strong (WS) in American English continuous speech production based upon a multi-layer feed forward neural network with seven acoustic features chosen to target the lexical stress variability between two consecutive syllables. Methods: The speech corpus used in this work is the PTDB-TUG. Five females and three males were chosen to form a training set and one female and one male for testing. The CMU pronouncing dictionary with lexical stress levels marked was used to assign stress levels to each syllable in all words in the speech corpus. Lexical stress is phonetically realized through the manipulation of signal intensity, the fundamental frequency (F0) and its dynamics and the syllable/vowel duration. The nucleus duration, syllable duration, mean pitch, maximum pitch over nucleus, the peak-to-peak amplitude integral over syllable nucleus, energy mean and maximum energy over nucleus were calculated for each syllable in the collected speech. As lexical stress errors are identified by evaluating the variability between consecutive syllables in a word, we computed the pairwise variability index ("PVI") for each acoustic measure. The PVI for any acoustic feature f_i is given by: PVI_i= (f_i1-f_i2)/(( f_i1+f_i2)/2)(1), where f_i1,f_i2 are the acoustic features of the first and second syllables consecutively. A multi-layer feed forward neural network which consisted of input, hidden and output layers was used to classify the stress patterns in the words in the database. Results: The presented system had an overall accuracy of 87.6%. It correctly classified 92.4% of the SW stress patterns and 76.5% of the WS stress pattern. Conclusions: A feed-forward neural network was used to classify between the SW and WS stress patterns in American English continuous speech with overall accuracy around 87 percent.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.