Dealing with unknown unknowns in speech

Hynek Hermansky

doi:10.1121/1.3654655

Abstract

Common belief in speech recognition community is that most significant improvements in performance on a machine come from more training data. Implicit is a tacit assumption that speech to be recognized comes from the same distribution as the speech on which the machine was trained. Problems occur when this assumption is violated. Words that are not in a lexicon of a machine, unexpected distortions of a signal and noises, unknown accents, and other speech peculiarities all create problems for the current ASR. The problem is inherent to machine learning and will not go away unless alternatives to extensive reliance on false beliefs of unchanging world are found. In an automatic recognition of speech, words that are not in the expected lexicon of the machine are typically substituted by some acoustically similar but nevertheless wrong words. Similarly, unexpected noise is typically ignored in human speech communication but causes significant problems to a machine. We discuss a biologically inspired multistream architecture of a speech recognition machine that could alleviate some of the problems with the unexpected acoustic inputs. Some published experimental results are given.

Full Text