Abstract
In recent years, there has been an increase in research around generating spatialized audio using a mono audio signal. Methods like using neural networks which combine image segmentation with object location for adding back the spatial qualities are often developed. Instead, our project focuses on taking arbitrary mono input sound and an input angle, and outputting spatial stereo audio of the input sound with the directionality of the angle. This is different from current implementations as it is a simpler approach to what spatial audio generation is, and it allows for the use in the model in new areas. Using a binaural microphone and a custom-made anechoic chamber, 120 hours of labelled binaural audio was recorded for use in our model. The audio consists of frequency sweeps, pink noise, and phonetically balanced speech. Our method is to predict a complex short-time Fourier transform mask which will contain the phase and amplitude within it. The model is an autoencoder based on the U-NET model, which is applied to the mono input before being compared to the labelled data for training. With this simpler approach, we hope to make spatial audio more accessible for a variety of applications.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have