Abstract

We present a novel non-iterative and rigorously motivated approach for estimating hidden Markov models (HMMs) and factorial hidden Markov models (FHMMs) of high-dimensional signals. Our approach utilizes the asymptotic properties of a spectral, graph-based approach for dimensionality reduction and manifold learning, namely the diffusion framework. We exemplify our approach by applying it to the problem of single microphone speech separation, where the log-spectra of two unmixed speakers are modeled as HMMs, while their mixture is modeled as an FHMM. We derive two diffusion-based FHMM estimation schemes. One of which is experimentally shown to provide separation results that compare with contemporary speech separation approaches based on HMM. The second scheme allows a reduced computational burden.

Highlights

  • Single-channel speech separation (SCSS) is one of the most challenging tasks in speech processing, where the aim is to unmix two or more concurrently speaking subjects, whose audio mixture is acquired by a single microphone

  • 5 Experimental results The proposed hybrid FHMM (HFHMM) and dual Factorial hidden Markov model (FHMM) (DFHMM) schemes were experimentally verified by studying common state-of-theart speech separation tasks

  • The proposed schemes are compared to the separation scheme proposed by Roweis [36], the iterative FHMM-based estimator by Hu and Wang [38], and to the MIXMAX estimator by Radfar and Dansereau [35]

Read more

Summary

Introduction

Single-channel speech separation (SCSS) is one of the most challenging tasks in speech processing, where the aim is to unmix two or more concurrently speaking subjects, whose audio mixture is acquired by a single microphone. Single-channel speech separation was studied by several schools of thought, where computational auditory scene analysis (CASA) proved to be among the most effective. CASA-based methods are motivated by the ability of the human auditory system to separate acoustic events, even when using a single ear ( binaural hearing is advantageous). CASA techniques imitate the human auditory filtering known as cochlear filtering, where time-frequency bins of the speech mixture are clustered using psychoacoustic cues such as the pitch period, temporal continuity, onsets and offsets, etc. The clustering associates each time-frequency bin with a particular source. The time-frequency bins associated with the desired source are retained, while those associated with

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call