Abstract

This paper presents a methodology that incorporates temporal feature integration for automated generalized sound recognition. Such a system can be of great use to scene analysis and understanding based on the acoustic modality. The performance of three feature sets based on Mel filterbank, MPEG-7 audio protocol, and wavelet decomposition is assessed. Furthermore we explore the application of temporal integration using the following three different strategies: (a) short-term statistics, (b) spectral moments, and (c) autoregressive models. The experimental setup is thoroughly explained and based on the concurrent usage of professional sound effects collections. In this way we try to form a representative picture of the characteristics of ten sound classes. During the first phase of our implementation, the process of audio classification is achieved through statistical models (HMMs) while a fusion scheme that exploits the models constructed by various feature sets provided the highest average recognition rate. The proposed system not only uses diverse groups of sound parameters but also employs the advantages of temporal feature integration.

Highlights

  • Humans have the ability to detect and recognize a sound event quite effortlessly

  • In addition to MPEG-7 audio standard and Mel filterbank, we investigated a novel method which encompasses the usage of multiresolution analysis of audio signals using critical-band-based wavelet packets

  • The results confirm that MPEG-7 audio protocol provides for each audio class a representation that follows a consistent pattern which can be modeled by left-right HMMs and used afterwards for classification of novel data

Read more

Summary

Introduction

Humans have the ability to detect and recognize a sound event quite effortlessly. we can concentrate on a particular sound event, isolating it from background noise, for example, focus on a conversation while loud music is playing. During the last decades emphasis has been placed upon methods for automated speech/speaker recognition This is due to the fact that speech plays an important role as regards to both human-human and human-machine interactions. Every sound source exhibits a consistent acoustic pattern which results in a specific way of distributing its energy on its frequency content. The categorization of sounds into distinct classes is sometimes ambiguous (an audio category may overlap with another) while composite real-world sound scenes can be very difficult to analyze. This fact has led to solutions which target specific problems while a generic system is still an open research subject

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call