Abstract

In this paper, we propose a phase reconstruction framework, named Deep Griffin–Lim Iteration (DeGLI). Phase reconstruction is a fundamental technique for improving the quality of sound obtained through some process in the time-frequency domain. It has been shown that the recent methods using deep neural networks (DNN) outperformed the conventional iterative phase reconstruction methods such as the Griffin–Lim algorithm (GLA). However, the computational cost of DNN-based methods is not adjustable at the time of inference, which may limit the range of applications. To address this problem, we combine the iterative structure of GLA with a DNN so that the computational cost becomes adjustable by changing the number of iterations of the proposed DNN-based component. A training method that is independent of the number of iterations for inference is also proposed to minimize the computational cost of the training. This training method, named sub-block training by denoising (SBTD), avoids recursive use of the DNN and enables training of DeGLI with a single sub-block (corresponding to one GLA iteration). Furthermore, we propose a complex DNN based on complex convolution layers with gated mechanisms and investigated its performance in terms of the proposed framework. Through several experiments, we found that DeGLI significantly improved both objective and subjective measures from GLA by incorporating the DNN, and its sound quality was comparable to those of neural vocoders.

Highlights

  • P HASE reconstruction of a spectrogram is an active research topic with various applications, such as speech synthesis [1]–[5], voice conversion [6], and sound source enhancement/separation [7]–[14]

  • We propose a phase reconstruction framework, named Deep Griffin–Lim Iteration (DeGLI), by incorporating a deep neural networks (DNN) into Griffin–Lim algorithm (GLA)

  • We have presented the DNN-based phase reconstruction framework, called DeGLI, which stacks the common GLAbased sub-block containing a DNN

Read more

Summary

Introduction

P HASE reconstruction of a spectrogram is an active research topic with various applications, such as speech synthesis [1]–[5], voice conversion [6], and sound source enhancement/separation [7]–[14]. As a coefficient of the short-time Fourier transform (STFT) is a complex number, it consists of magnitude and phase. While both of them are necessary for reconstructing the corresponding time-domain signal using the Manuscript received May 1, 2020; revised August 25, 2020; accepted October 12, 2020. Traditional sound source enhancement applies a real-valued time-frequency (T-F) mask, which modifies amplitude without affecting phase [15]. Another example is a recent speech synthesis approach that generates a time-domain signal by applying iSTFT to the synthesized spectrogram after phase reconstruction [1]–[5]. Phase reconstruction is necessary for such amplitude-based acoustical technologies to obtain a time-domain signal with better sound quality [16]–[19]

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call