Abstract

A machine simulation of human auditory perception must be able to recognize and classify individual sound sources. The most successful technique for sound classification is the statistical pattern recognition approach employed in speech recognizers; however, in most practical cases, this approach assumes that the entire (monaural) signal represents the source to be classified. Realistic ‘‘cocktail-party’’ scenes, composed of multiple, overlapping sources with comparable energies, do not come close to meeting this assumption. A more workable assumption is to treat each time-frequency cell as representing a single source, and to use missing-data techniques to perform recognition using only a subset of the cells. This precludes the use of cepstral features (which depend on every frequency component), but is otherwise practical. The problem then becomes finding the ‘‘present data mask’’ that indicates which cells are to be considered during classification of a particular source. We will present a system based on these principles, with applications both to speech recognition in dynamic, noisy backgrounds, and also to nonspeech sounds such as alarms that can occur at very poor signal-to-noise ratios.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.