Abstract
Speech segregation from a monaural recording is a primary task of auditory scene analysis, and has proven to be very challenging. We present a multistage model for the task. The model starts with simulated auditory periphery. A subsequent stage computes midlevel auditory representations, including correlograms and cross-channel correlations. The core of the system performs segmentation and grouping in a two-dimensional time-frequency representation that encodes proximity in frequency and time, periodicity, and amplitude modulation (AM). Motivated by psychoacoustic observations, our system employs different mechanisms for handling resolved and unresolved harmonics. For resolved harmonics, our system generates segments—basic components of an auditory scene—based on temporal continuity and cross-channel correlation, and groups them according to periodicity. For unresolved harmonics, the system generates segments based on AM in addition to temporal continuity and groups them according to AM repetition rates. We derive AM repetition rates using sinusoidal modeling and gradient descent. Underlying the segregation process is a pitch contour that is first estimated from speech segregated according to global pitch and then adjusted according to psychoacoustic constraints. The model has been systematically evaluated, and it yields substantially better performance than previous systems.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.