Abstract

This paper presents a new method of singing voice analysis that performs mutually-dependent singing voice separation and vocal fundamental frequency (F0) estimation. Vocal F0 estimation is considered to become easier if singing voices can be separated from a music audio signal, and vocal F0 contours are useful for singing voice separation. This calls for an approach that improves the performance of each of these tasks by using the results of the other. The proposed method first performs robust principal component analysis (RPCA) for roughly extracting singing voices from a target music audio signal. The F0 contour of the main melody is then estimated from the separated singing voices by finding the optimal temporal path over an F0 saliency spectrogram. Finally, the singing voices are separated again more accurately by combining a conventional time-frequency mask given by RPCA with another mask that passes only the harmonic structures of the estimated F0s. Experimental results showed that the proposed method significantly improved the performances of both singing voice separation and vocal F0 estimation. The proposed method also outperformed all the other methods of singing voice separation submitted to an international music analysis competition called MIREX 2014.

Highlights

  • S INGING voice analysis is important for active music listening interfaces [1] that enable a user to customize the contents of existing music recordings in ways not limited to frequency equalization and tempo adjustment

  • The parameters λ and α affect the accuracy of vocal F0 estimation. λ is the sparsity factor of robust principal component analysis (RPCA) and α is the weight parameter for computing the F0-saliency spectrogram described in Section III-B2. α determines the balance between an subharmonic summation (SHS) spectrogram and an F0 enhancement spectrogram in a F0-saliency spectrogram, and there must be range of its value that provides robust performance

  • We evaluated the accuracy of singing voice separation for combinations of λ from 0.6 to 1.1 in steps of 0.1 and α from 0 to 2.0 in steps of 0.2

Read more

Summary

Introduction

S INGING voice analysis is important for active music listening interfaces [1] that enable a user to customize the contents of existing music recordings in ways not limited to frequency equalization and tempo adjustment. Since singing voices tend to form main melodies and strongly affect the moods of musical pieces, several methods have been proposed for editing the three major kinds of acoustic characteristics of singing voices: fundamental frequencies (F0s), timbres, and volumes. A system of speech analysis and synthesis called TANDEM-STRAIGHT [2], for example, decomposes human voices into F0s, spectral envelopes (timbres), and non-periodic. Manuscript received December 3, 2015; revised March 28, 2016 and May 25, 2016; accepted May 25, 2016.

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call