In digital media or games, sound effects are typically recorded or synthesized. While there are a great many digital synthesis tools, the synthesized audio quality is generally not on par with sound recordings. Nonetheless, sound synthesis techniques provide a popular means to generate new sound variations. In this research, we study sound effects synthesis using generative models that are inspired by the models used for high-quality speech and music synthesis. In particular, we explore the trade-off between synthesis quality and variation. With regard to quality, we integrate a reconstruction loss into the original training objective to penalize imperfect audio reconstruction and compare it with neural vocoders and traditional spectrogram inversion methods. We use a Wasserstein GAN (WGAN) as an example model to explore the synthesis quality of generated sound effects, such as footsteps, birds, guns, rain, and engine sounds. In addition to synthesis quality, we also consider the range of sound variation that is possible with our generative model. We report on the trade-off that we obtain with our model regarding the quality and diversity of synthesized sound effects.
Read full abstract