Abstract

Classical models of speech recognition (both human and machine) assume that a detailed, short-term analysis of the signal is essential for accurate decoding of spoken language via a linear sequence of phonetic segments. This classical framework is incommensurate with quantitative acoustic/phonetic analyses of spontaneous discourse (e.g., the Switchboard corpus for American English). Such analyses indicate that the syllable, rather than the phone, is likely to serve as the representational interface between sound and meaning, providing a relatively stable representation of lexically relevant information across a wide range of speaking and acoustic conditions. The auditory basis of this syllabic representaion appears to be derived from the low-frequency (2–16 Hz) modulation spectrum, whose temporal properties correspond closely to the distribution of syllabic durations observed in spontaneous speech. Perceptual experiments confirm the importance of the modulation spectrum for understanding spoken language and demonstrate that the intelligibility of speech is derived from both the amplitude and phase components of this spectral representation. Syllable-based automatic speech recognition systems, currently under development, have proven useful under various acoustic conditions representative of the real world (such as reverberation and background noise) when used in conjunction with more traditional, phone-based recognition systems.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.