Abstract
A major difficulty in articulatory analysis/synthesis is the estimation of vocal-tract parameters from input speech. The use of neural networks to extract these parameters is more attractive than codebook look-up due to the lower computational complexity. For example, a multilayer perceptron (MLP) with two hidden layers, trained and evaluated on a small data set was shown to perform a reasonable mapping of acousticto-geometric parameters. Increasing the training data, however, revealed ambiguity in the mapping that could not be resolved by a single network. This paper addresses the problem using an assembly of MLP's, each designated to a specific region in the articulatory space. Training data were generated by randomly sampling the parameters of an articulatory model of the vocal system. The resultant vocal-tract shapes were clustered into 128 regions, and an MLP with one hidden layer was assigned to each of these regions for mapping 18 cepstral coefficients into ten tract areas, and a nasalization parameter. Networks were selected by dynamic programming, and were used to control a time-domain articulatory synthesizer. After training, significant perceptual and objective improvements were achieved relative to using a single MLP. Comparable performance to codebook look-up with dynamic programming was obtained. This model, however, requires only 4% of the storage needed for the codebook, and performs the mapping faster by a factor of 20.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.