Abstract
Monaural speech segregation remains a computational challenge for auditory scene analysis (ASA). A major problem for existing computational auditory scene analysis (CASA) systems is their inability to deal with signals in the high-frequency range. Psychoacoustic evidence suggests that different perceptual mechanisms are involved to handle resolved and unresolved harmonics. We propose a system for speech segregation that deals with low-frequency and high-frequency signals differently. For low-frequency signals, our model generates segments based on temporal continuity and cross-channel correlation, and groups them according to periodicity. For high-frequency signals. the model generates segments based on common amplitude modulation (AM) in addition to temporal continuity, and groups them according to AM repetition rates. Underlying the grouping process is a pitch contour that is first estimated from segregated speech based on global pitch and then verified by psychoacoustic constraints. Our system is systematically evaluated, and it yields substantially better performance than previous CASA systems, especially in the high-frequency range.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.