Abstract

In text, word2vec transforms each word into a fixed-size vector used as the basic component in applications of natural language processing. Given a large collection of unannotated audio, audio word2vec can also be trained in an unsupervised way using a sequence-to-sequence autoencoder (SA). These vector representations are shown to effectively describe the sequential phonetic structures of the audio segments. In this paper, we further extend this research in the following two directions. First, we disentangle phonetic information and speaker information from the SA vector representations. Second, we extend audio word2vec from the word level to the utterance level by proposing a new segmental audio word2vec in which unsupervised spoken word boundary segmentation and audio word2vec are jointly learned and mutually enhanced, and utterances are directly represented as sequences of vectors carrying phonetic information. This is achieved by means of a segmental sequence-to-sequence autoencoder, in which a segmentation gate trained with reinforcement learning is inserted in the encoder.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call