Method and apparatus for converting text into audible signals using a neural network

Orhan Karaali

doi:10.1121/1.422728

Abstract

Text may be converted to audible signals, such as speech, by first training a neural network using recorded audio messages (204). To begin the training, the recorded audio messages are converted into a series of audio frames (205) having a fixed duration (213). Then, each audio frame is assigned a phonetic representation (203) and a target acoustic representation, where the phonetic representation (203) is a binary word that represents the phone and articulation characteristics of the audio frame, while the target acoustic representation is a vector of audio information such as pitch and energy. After training, the neural network is used in conversion of text into speech. First, text that is to be converted is translated to a series of phonetic frames of the same form as the phonetic representations (203) and having the fixed duration (213). Then the neural network produces acoustic representations in response to context descriptions (207) that include some of the phonetic frames. The acoustic representations are then converted into a speech wave form by a synthesizer.

Full Text