Abstract
The aim of the study was to develop a method for automatic classification of the three spatial audio scenes, differing in horizontal distribution of foreground and background audio content around a listener in binaurally rendered recordings of music. For the purpose of the study, audio recordings were synthesized using thirteen sets of binaural-room-impulse-responses (BRIRs), representing room acoustics of both semi-anechoic and reverberant venues. Head movements were not considered in the study. The proposed method was assumption-free with regards to the number and characteristics of the audio sources. A least absolute shrinkage and selection operator was employed as a classifier. According to the results, it is possible to automatically identify the spatial scenes using a combination of binaural and spectro-temporal features. The method exhibits a satisfactory classification accuracy when it is trained and then tested on different stimuli but synthesized using the same BRIRs (accuracy ranging from 74% to 98%), even in highly reverberant conditions. However, the generalizability of the method needs to be further improved. This study demonstrates that in addition to the binaural cues, the Mel-frequency cepstral coefficients constitute an important carrier of spatial information, imperative for the classification of spatial audio scenes.
Highlights
This study builds on the work on the spatial audio scene characterization of five-channel surround sound recordings undertaken by Zieliński [23,24]
The aim of the first experiment was to check how the method performed when trained and tested on the excerpts synthesized using the same sets of BRIRs
The exception was the model obtained for the BRIR set No 7, for which the accuracy attained for the spectral features and the Root Mean Square (RMS)-based metrics was equal to 75.8% and 69.2%, respectively
Summary
Binaural audio technology is rapidly gaining popularity. For example, it is widely used for the rendering of 360◦ virtual reality content in one of the most popular video-sharing Internet services [1]. Multiple-source localization models have been developed [6,7,8,9], which constitutes an important step towards quantification of higher-level attributes (e.g., ensemble width), leading to a holistic characterization of complex spatial audio scenes. While their reported accuracy is deemed to be good, their applicability is limited, as they often require a priori knowledge about the number of sources of interest and their signal characteristics. The above considerations underlay the work described in this paper
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.