In data-driven corpus-based text-to-speech synthesis systems, the main issue is to select the most natural-sounding sequence of acoustic units without unnatural acoustic transitions, and to minimize all acoustic mismatches at the concatenation points. Unit selection algorithms incorporating unit selection cost functions have been known to synthesize speech close to natural quality. However, these algorithms operate over large acoustic inventories with huge number of acoustic units in a broad spectrum of linguistic, prosodic and acoustic contexts, and with a huge number of concatenation possibilities. Moreover, the shape of the unit selection cost function, which evaluates the cost of concatenating two subsequent acoustic units, is modelled manually in a time-consuming and laborious iterative process, which is based on subjective evaluation. Since this process must be repeated for any new acoustic inventory, or even after changes in a given acoustic inventory, we propose instead a new fuzzy unit selection cost function. We further propose to optimize fully automatically the shape of the fuzzy unit selection cost function to the given acoustic inventory’s context by using a relaxed gradient descent algorithm, where the subjective tests are replaced by a novel objective measure needed to evaluate unit selection cost function performance. Furthermore, the proposed approach is fully interpretable and also highlights insights into which parts of the fuzzy unit selection cost function’s shape could be further improved. The experiments show that the optimized fuzzy unit selection cost function significantly outperforms the baseline fuzzy unit selection cost function. Moreover, the results prove that the unit selection optimization algorithm is capable of finding the optimal shape of the fuzzy unit selection cost function, even when optimized over a small subset of sentences.
Read full abstract