Abstract
Speech segmentation is a crucial step in automatic speech recognition because additional speech analyses are performed for each framed speech segment. Conventional segmentation techniques primarily segment speech using a fixed frame size for computational simplicity. However, this approach is insufficient for capturing the quasi-regular structure of speech, which causes substantial recognition failure in noisy environments. How does the brain handle quasi-regular structured speech and maintain high recognition performance under any circumstance? Recent neurophysiological studies have suggested that the phase of neuronal oscillations in the auditory cortex contributes to accurate speech recognition by guiding speech segmentation into smaller units at different timescales. A phase-locked relationship between neuronal oscillation and the speech envelope has recently been obtained, which suggests that the speech envelope provides a foundation for multi-timescale speech segmental information. In this study, we quantitatively investigated the role of the speech envelope as a potential temporal reference to segment speech using its instantaneous phase information. We evaluated the proposed approach by the achieved information gain and recognition performance in various noisy environments. The results indicate that the proposed segmentation scheme not only extracts more information from speech but also provides greater robustness in a recognition test.
Highlights
Segmenting continuous speech into short frames is the first step in the feature extraction process of an automatic speech recognition (ASR) system
By following these observations in the brain, the nested oscillatory reference effect in the auditory system is modeled by a series of steps as follows: (i) extract primary and secondary frequency band oscillations from the speech envelope as speech segmental references; (ii) partition primary and secondary frequency band oscillations using their phase quadrant boundaries as the frame start and end points, and (iii) couple primary and secondary frequency band oscillations such that the property of the primary frequency band oscillation shapes the appearance of the secondary frequency band oscillation
Extraction of the secondary frequency band oscillation from the speech envelope is performed in the first frame region, where its energy falls within the threshold range
Summary
Segmenting continuous speech into short frames is the first step in the feature extraction process of an automatic speech recognition (ASR) system. For the periodic parts of speech, such as a vowel, the conventional frame size and shift rate cause unnecessary overlap, leading to the addition of redundant information and insertion errors in noisy environments[5] To overcome these problems, various speech segmentation techniques have been proposed[6]. Six typical frequency bands under 50 Hz (i.e., delta, 0.4~4 Hz; theta, 4~10 Hz; alpha, 11~16 Hz; beta, 16~25 Hz; low gamma, 25~35 Hz; and mid gamma, 35~50 Hz) of the speech envelope were examined as potential frequency bands of primary and secondary band oscillations These frequency bands were chosen because they have a close correspondence with the timescales of various units in speech[28,29,30] (e.g., sub-phonemic, phonemic, and syllabic) but are extensively observed in the brain cognitive processes, including speech comprehension at the auditory cortex[22,23,24]. We quantitatively compared the amount of information extracted by the proposed NVFS scheme with the conventional FFSR scheme and compared the effectiveness of each segmentation scheme with a speech recognition test
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.