Abstract

Packet loss concealment (PLC) aims to mitigate speech impairments caused by packet losses so as to improve speech perceptual quality. This paper proposes an end-to-end PLC algorithm with a time-frequency hybrid generative adversarial network, which incorporates a dilated residual convolution and the integration of a time-domain discriminator and frequency-domain discriminator into a convolutional encoder-decoder architecture. The dilated residual convolution is employed to aggregate the short-term and long-term context information of lost speech frames through two network receptive fields with different dilation rates, and the integrated time-frequency discriminators are proposed to learn multi-resolution time-frequency features from correctly received speech frames with both time-domain waveform and frequency-domain complex spectrums. Both causal and noncausal strategies are proposed for the packet-loss problem, which can effectively reduce the transitional distortion caused by lost speech frames with a significantly reduced number of training parameters and computational complexity. The experimental results show that the proposed method can achieve better performance in terms of three objective measurements, including the signal-to-noise ratio, perceptual evaluation of speech quality, and short-time objective intelligibility. The results of the subjective listening test further confirm a better performance in the speech perceptual quality.

Highlights

  • Remote speech transmission is an important technology that is widely used for long-distance audio-video communication

  • This paper focuses on reconstructing the continuous speech of the jth speech frame when it has been lost during transmission

  • It is noted that the reconstruction performance of the magnitude-only discriminator, referred to as WavMag integrated discriminators in Table II, is similar to that of the convolutional encoderdecoder (CED) based on the multi-resolution short-time Fourier transform (STFT) loss, whereas the proposed complex-spectrum discriminator, referred to as Wav-DC integrated discriminators in Table II, exhibits more significant improvements

Read more

Summary

Introduction

Remote speech transmission is an important technology that is widely used for long-distance audio-video communication. The greatest challenges of this technology include transmission latency and speech quality. Compared with the traditional circuit-switched transmission networks, some speech contents may be lost because of the packet delay, loss, or jitter during remote transmission (Takahashi et al, 2004). These problems may degrade the quality of decoded speech and the speech communication experience of the users. To mitigate the packet loss at the receiver side, a jitter buffer can be applied to connect the received speech frames at the expense of introducing some additional latency (Liang et al, 2003). It is not a trivial task to make a) at: Link€oping University–Guangzhou University Research Center on Urban Sustainable Development, Guangzhou University, Guangzhou, 510006, China

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call