Abstract
Phone tokenisers are used in spoken language recognition (SLR) to obtain elementary phonetic information. We present a study on the use of deep neural network tokenisers. Unsupervised crosslingual adaptation was performed to adapt the baseline tokeniser trained on English conversational telephone speech data to different languages. Two training and adaptation approaches, namely cross-entropy adaptation and state-level minimum Bayes risk adaptation, were tested in a bottleneck i-vector and a phonotactic SLR system. The SLR systems using the tokenisers adapted to different languages were combined using score fusion, giving 7–18% reduction in minimum detection cost function (minDCF) compared with the baseline configurations without adapted tokenisers. Analysis of results showed that the ensemble tokenisers gave diverse representation of phonemes, thus bringing complementary effects when SLR systems with different tokenisers were combined. SLR performance was also shown to be related to the quality of the adapted tokenisers.
Highlights
TagedPIn a spoken language recognition (SLR) task, an automatic system is used to infer the language identity of the given acoustic signal (Muthusamy et al, 1994)
TagedPInspired by the use different tokenisers trained in multiple languages (D’Haro et al, 2014), we extend our previous work on this topic (Ng et al, 2016a) where unsupervised deep neural networks (DNN) adaptation was used in the tokenisers for a TagedPphonotactic SLR system
TagedPThe phonotactic SLR system in this study models Term frequency (TF)-Inverse document frequency (IDF) features which were derived from the n-gram statistics of the one-best DNN tokeniser output
Summary
TagedPIn a spoken language recognition (SLR) task, an automatic system is used to infer the language identity of the given acoustic signal (Muthusamy et al, 1994). Regardless of the quality of the tokeniser, when the tokeniser is applied on multilingual data, the occurrence patterns of the output tokens differ from one language to another significantly This allows for modelling and language classification (Zissman, 1996; Singer et al, 2003; Hazen and Zue, 1997; Navratil, 2006; Glembek et al, 2008). TagedPThis study compares the use of multiple tokenisers derived from different multilingual data through unsupervised training and adaptation to capture diverse linguistic information for the SLR modelling. The performance on phoneme recognition of the newly trained/adapted speech recognisers do not necessarily need to be optimal, but rather they only serve to provide different representations of the acoustic data in terms of its tokenisation, and allow complementary effects to appear in the late SLR system fusion. In the work reported here, the method has been further extended to the state-of-the-art bottleneck i-vector SLR system, and different ways of training and adapting the DNN tokenisers have been investigated
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.