Abstract
We present a novel non-iterative and rigorously motivated approach for estimating hidden Markov models (HMMs) and factorial hidden Markov models (FHMMs) of high-dimensional signals. Our approach utilizes the asymptotic properties of a spectral, graph-based approach for dimensionality reduction and manifold learning, namely the diffusion framework. We exemplify our approach by applying it to the problem of single microphone speech separation, where the log-spectra of two unmixed speakers are modeled as HMMs, while their mixture is modeled as an FHMM. We derive two diffusion-based FHMM estimation schemes. One of which is experimentally shown to provide separation results that compare with contemporary speech separation approaches based on HMM. The second scheme allows a reduced computational burden.
Highlights
Single-channel speech separation (SCSS) is one of the most challenging tasks in speech processing, where the aim is to unmix two or more concurrently speaking subjects, whose audio mixture is acquired by a single microphone
5 Experimental results The proposed hybrid FHMM (HFHMM) and dual Factorial hidden Markov model (FHMM) (DFHMM) schemes were experimentally verified by studying common state-of-theart speech separation tasks
The proposed schemes are compared to the separation scheme proposed by Roweis [36], the iterative FHMM-based estimator by Hu and Wang [38], and to the MIXMAX estimator by Radfar and Dansereau [35]
Summary
Single-channel speech separation (SCSS) is one of the most challenging tasks in speech processing, where the aim is to unmix two or more concurrently speaking subjects, whose audio mixture is acquired by a single microphone. Single-channel speech separation was studied by several schools of thought, where computational auditory scene analysis (CASA) proved to be among the most effective. CASA-based methods are motivated by the ability of the human auditory system to separate acoustic events, even when using a single ear ( binaural hearing is advantageous). CASA techniques imitate the human auditory filtering known as cochlear filtering, where time-frequency bins of the speech mixture are clustered using psychoacoustic cues such as the pitch period, temporal continuity, onsets and offsets, etc. The clustering associates each time-frequency bin with a particular source. The time-frequency bins associated with the desired source are retained, while those associated with
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: EURASIP Journal on Audio, Speech, and Music Processing
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.