Abstract
Low-level audio features are commonly used in many audio analysis tasks, such as audio scene classification or acoustic event detection. Due to the variable length of audio signals, it is a common approach to create fixed-length feature vectors consisting of a set of statistics that summarize the temporal variability of such short-term features. To avoid the loss of temporal information, the audio event can be divided into a set of mid-term segments or texture windows. However, such an approach requires to estimate accurately the onset and offset times of the audio events in order to obtain a robust mid-term statistical description of their temporal evolution. This paper proposes the use of an alternative event representation based on nonlinear time normalization prior to the extraction of mid-term statistics. The short-term features are transformed into a new fixed-length representation that considers uniform distance subsampling over a defined feature space in contrast to the classical short-term temporal framing. The results show that the use of distance-based texture windows provides an improved statistical description of the event robust to errors in the event segmentation stage under noisy conditions.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: IEEE/ACM Transactions on Audio, Speech, and Language Processing
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.