Abstract

Current ASR uses two main sources of information, the top-down prior knowledge in the form of a statistical language and the bottom-up information from acoustic speech signal. The information from acoustic signal is presented in so called speech attributes that are derived from the signal. Any information that gets lost during attribute extraction cannot be recovered. Any preserved information that is irrelevant to the ASR task could complicate further processing. Attributes in conventional ASR are derived by some transformation of the short-term spectrum of speech that represents magnitude frequency components of a short segment of the signal. Origins of this representation can be traced to speech vocoding. For a past decade, we are working on attributes that are derived from posterior probabilities of speech sounds that are estimated by multilayer perceptron trained on large amounts of labeled data. This speech representation is now used in most leading state-of-the-art experimental systems. The talk will describe several research issues that influence the effectiveness of these posterior-based attributes. These issues include the choice of classes of speech sounds, input evidence for the posterior estimation, architecture and training of the posterior probability estimator, and post-processing of posterior estimates for use with conventional GMM-HMM ASR.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call