Mandarin Text-to-Speech Front-End with Lightweight Distilled Convolution Network

Wei Zhao,Zuyi Wang,Li Xu

doi:10.1109/lsp.2023.3256319

Abstract

Mandarin text-to-speech (TTS) systems heavily depend on front-end processing, such as grapheme-to-phoneme conversion and prosodic boundary prediction, to produce expressive, human-like speech. Utilizing a pre-trained language model, such as the bidirectional encoder representations from Transformers (BERT), could significantly improve the TTS front-end's performance. However, the original BERT is too big for edge TTS applications with tight limits on memory costs and inference latency. Although a distilled BERT can alleviate this problem, a considerable efficiency barrier may still exist due to the self-attention module's quadratic complexity and the feed-forward module's enormous computation. To this end, we propose a lightweight distilled convolution network as an alternative to the distilled BERT. Unlike previous knowledge distillation methods, which commonly used the same self-attention network for the teacher and student models, we transfer knowledge from the self-attention network to a convolution network. Experiments on two major Mandarin TTS front-end tasks have shown that our distilled convolution model can achieve comparable results to various distilled BERT variants while drastically reducing the model size and inference latency.

Full Text