In the paper we compare different attention mechanisms on the task of audio generation using unsupervised approaches following previous work in language modeling. It is important problem, as far as speech synthesis technology could be used to convert textual information into acoustic waveform signals. These representations can be conveniently integrated into mobile devices and used in such applications as voice messengers or email apps. Sometimes it is difficult to understand and read important messages when being abroad. The lack of appropriate computer systems or some security problems may arise. With this technology, e-mail messages can be listened quickly and efficiently on smartphones, boosting productivity. Apart from that, it is used to assist visually impaired people, so that, for instance, the screen content can be automatically read aloud to a blind user. Nowadays, home appliances, like slow cookers can use this system too for reading culinary recipes, automobiles for voice navigation to the destination spot, or language learners for pronunciation teaching. Speech generation is the opposite problem of automatic speech recognition (ASR) and is researched since the second half of the eighteen's century. Also, this technology also helps vocally handicapped people find a way to communicate with others who do not understand sign language. However, there is a problem, related to the fact that the audio sampling rate is very high, thus leading to very long sequences which are computationally difficult to model. Second challenge is that speech signals with the same semantic meaning can be represented by a lot of signals with significant variability, which is caused by channel environment, pronunciation or speaker timbre characteristics. To overcome these problems, we train an autoencoder model to discretize continuous audio signal into a finite set of discriminative audio tokens which have a lower sampling rate. Subsequently, autoregressive models, which are not conditioned on text, are trained on this representation space to predict the next token, based on previous sequence elements. Hence, this modeling approach resembles causal language modeling. In our study, we show that unlike in the original MEGA work, traditional attention outperforms moving average equipped gated attention, which shows that EMA gated attention is not stable yet and requires careful hyper-parameter optimization.
Read full abstract