A major drawback of corpus-based speech synthesis systems is the use of large acoustic inventories, and currently one of the main challenges is the optimal representation of concatenation costs associated with units in the acoustic inventory. These concatenation costs are used to evaluate spectral mismatches between the acoustic units to be concatenated. The combinatorics of costs grows exponentially with the size of the acoustic inventories and can result in hundreds of millions or even billions of concatenation costs to be processed. Therefore, in this paper, we represent a novel unit selection optimization algorithm, which minimizes the size of concatenation costs through the vector quantization-based compression technique and tuple structures. Furthermore, the proposed optimization algorithm is designed to be used as an objective measure to optimize the performance of the unit selection cost function regarding the quality of the speech output, and to evaluate the effect of the vector quantization-based compression technique on its performance. The results obtained show that even when data compression is above 50%, the effect on the quality of the synthesized speech is negligible.
Read full abstract