Abstract

Inspired by the success of image classification and speech recognition, deep learning algorithms have been explored to solve music source separation. Solving this problem would open to a wide range of applications like automatic transcription, audio post-production, and many more. Most algorithms usually use the Short Time Fourier Transform (STFT) as the Time-Frequency (T-F) input representation. Each deep learning model has a different configuration for STFT. There is no constant STFT parameters that is used in solving music source separation. This paper explores the different parameters for STFT and investigates another representation, the Constant-Q Transform, in separating three individual sound sources. Results of experiments show that dilated convolutional layers are great for STFT while normal convolutional layers are great for CQT. The best T-F representation for music source separation is STFT with dilated CNNs and a soft masking method. Furthermore, researchers should still consider the parameters of the T-F representations to have better performance for their deep learning models.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.