Abstract

Speech signals are produced by the smooth and continuous movements of the human articulators. An articulatory representation of speech is considered to be a more compact, more universal, and language-independent speech feature space and can, therefore, improve crosslingual and multilingual speech recognition systems, especially when porting components from one language to another in low-resource scenarios. However, learning the acoustic-to-articulatory conversion has proven to be a very challenging task. In this paper, we utilize a manifold learning technique to derive a nonlinear feature transformation from the conventional filterbank feature space to an articulatory-like feature space. The coordinates in the resultant representation of which some have demonstrable phonological meaning are shown to be highly portable across languages. We propose a proper framework in terms of data selection and graph construction to train coordinates from multilingual data, which allows for training the coordinate space when we have abundant out-of-language data. Deep neural network (DNN) bottleneck features are demonstrated to exhibit a greater degree of language independence when using this representation than in the case of filterbank features as inputs. The usability of this representation is further demonstrated in a number of speech recognition experiments using DNNs in a variety of crosslingual and multilingual scenarios using the multilingual GlobalPhone dataset. Especially, speech recognition systems developed in low-resource settings profit from the improved portability across languages.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call