Large acoustic inventories must be used to produce speech close to natural quality. However, the concatenation cost space grows exponentially with the number of acoustic units in the acoustic inventory, increasing the latency of the unit selection algorithm, making algorithms unusable in real-time end-to-end systems. Even when data compression techniques are introduced, the model size is still high, representing a challenge for end-to-end systems. Thus, in this paper, we propose representing the concatenation cost space using LSTM (Long Short-Term Memory). The results show a 90% reduction in the size of the data space compared to all our previous techniques, and by an over 70% decrease in the look-up time. The proposed LSTM-based compression increases the responsiveness of the corpus-based text-to-speech systems significantly while keeping the overall speech quality at the same level.
Read full abstract