Abstract

Mandarin text-to-speech (TTS) systems heavily depend on front-end processing, such as grapheme-to-phoneme conversion and prosodic boundary prediction, to produce expressive, human-like speech. Utilizing a pre-trained language model, such as the bidirectional encoder representations from Transformers (BERT), could significantly improve the TTS front-end's performance. However, the original BERT is too big for edge TTS applications with tight limits on memory costs and inference latency. Although a distilled BERT can alleviate this problem, a considerable efficiency barrier may still exist due to the self-attention module's quadratic complexity and the feed-forward module's enormous computation. To this end, we propose a lightweight distilled convolution network as an alternative to the distilled BERT. Unlike previous knowledge distillation methods, which commonly used the same self-attention network for the teacher and student models, we transfer knowledge from the self-attention network to a convolution network. Experiments on two major Mandarin TTS front-end tasks have shown that our distilled convolution model can achieve comparable results to various distilled BERT variants while drastically reducing the model size and inference latency.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call