Abstract

The purpose of this paper is to compare the performance of human listeners against the selected machine learning algorithms in the task of the classification of spatial audio scenes in binaural recordings of music under practical conditions. The three scenes were subject to classification: (1) music ensemble (a group of musical sources) located in the front, (2) music ensemble located at the back, and (3) music ensemble distributed around a listener. In the listening test, undertaken remotely over the Internet, human listeners reached the classification accuracy of 42.5%. For the listeners who passed the post-screening test, the accuracy was greater, approaching 60%. The above classification task was also undertaken automatically using four machine learning algorithms: convolutional neural network, support vector machines, extreme gradient boosting framework, and logistic regression. The machine learning algorithms substantially outperformed human listeners, with the classification accuracy reaching 84%, when tested under the binaural-room-impulse-response (BRIR) matched conditions. However, when the algorithms were tested under the BRIR mismatched scenario, the accuracy obtained by the algorithms was comparable to that exhibited by the listeners who passed the post-screening test, implying that the machine learning algorithms capability to perform in unknown electro-acoustic conditions needs to be further improved.

Highlights

  • Following its success in virtual-reality applications [1,2,3], binaural technology is being gradually adopted by professional and amateur broadcasters [4,5], with a prospect of becoming a prominent “tool” for delivering 3D audio content over the Internet

  • The outcomes of such research could help to design systems for the semantic search and retrieval of audio content in binaural recordings based on spatial information, allowing listeners to explore Internet resources looking for recordings with a music ensemble located at particular directions, e.g., behind the head of a listener

  • We considered a case whereby, for a given stimulus, the scene designated by a listener matched the intended scene during its binaural synthesis

Read more

Summary

Introduction

Following its success in virtual-reality applications [1,2,3], binaural technology is being gradually adopted by professional and amateur broadcasters [4,5], with a prospect of becoming a prominent “tool” for delivering 3D audio content over the Internet. For humans and machines alike, spatial perception of audio sources in the horizontal plane is facilitated by such binaural cues as interaural time difference (ITD), interaural level difference (ILD), and interaural coherence (IC) [2]. Foreground (BF) scene—background content in front of a listener with foreground content perceived from the back (a reversed stage-audio scenario) While it is infrequently used in music recordings, it was included in this study for completeness, as a symmetrically “flipped”. Foreground–Foreground (FF) scene—foreground content both in front of and behind a listener, surrounding a listener in a horizontal plane This scene is often used in binaural music recordings, e.g., in electronica, dance, and pop music (360◦ source scenario [33]). The remaining 136 listeners carried out their listening tests in unknown and uncontrolled environments They were requested to report the manufacturer and the model of the headphones used. Some models of the headphones employed an active noise reduction system, potentially degrading the faithfulness of a spatial audio reproduction

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call