Unsupervised crosslingual adaptation of tokenisers for spoken language recognition

Raymond W.M Ng,Mauro Nicolao,Thomas Hain

doi:10.1016/j.csl.2017.05.002

Raymond W.M Ng, Mauro Nicolao + Show 1 more

Open Access

https://doi.org/10.1016/j.csl.2017.05.002

Copy DOI

Journal: Computer Speech & Language	Publication Date: May 19, 2017
Citations: 8	License type: cc-by

Affiliation: University of Sheffield

Abstract

Phone tokenisers are used in spoken language recognition (SLR) to obtain elementary phonetic information. We present a study on the use of deep neural network tokenisers. Unsupervised crosslingual adaptation was performed to adapt the baseline tokeniser trained on English conversational telephone speech data to different languages. Two training and adaptation approaches, namely cross-entropy adaptation and state-level minimum Bayes risk adaptation, were tested in a bottleneck i-vector and a phonotactic SLR system. The SLR systems using the tokenisers adapted to different languages were combined using score fusion, giving 7–18% reduction in minimum detection cost function (minDCF) compared with the baseline configurations without adapted tokenisers. Analysis of results showed that the ensemble tokenisers gave diverse representation of phonemes, thus bringing complementary effects when SLR systems with different tokenisers were combined. SLR performance was also shown to be related to the quality of the adapted tokenisers.

Highlights

TagedPIn a spoken language recognition (SLR) task, an automatic system is used to infer the language identity of the given acoustic signal (Muthusamy et al, 1994)
TagedPInspired by the use different tokenisers trained in multiple languages (D’Haro et al, 2014), we extend our previous work on this topic (Ng et al, 2016a) where unsupervised deep neural networks (DNN) adaptation was used in the tokenisers for a TagedPphonotactic SLR system
TagedPThe phonotactic SLR system in this study models Term frequency (TF)-Inverse document frequency (IDF) features which were derived from the n-gram statistics of the one-best DNN tokeniser output

Summary

Introduction

TagedPIn a spoken language recognition (SLR) task, an automatic system is used to infer the language identity of the given acoustic signal (Muthusamy et al, 1994). Regardless of the quality of the tokeniser, when the tokeniser is applied on multilingual data, the occurrence patterns of the output tokens differ from one language to another significantly This allows for modelling and language classification (Zissman, 1996; Singer et al, 2003; Hazen and Zue, 1997; Navratil, 2006; Glembek et al, 2008). TagedPThis study compares the use of multiple tokenisers derived from different multilingual data through unsupervised training and adaptation to capture diverse linguistic information for the SLR modelling. The performance on phoneme recognition of the newly trained/adapted speech recognisers do not necessarily need to be optimal, but rather they only serve to provide different representations of the acoustic data in terms of its tokenisation, and allow complementary effects to appear in the late SLR system fusion. In the work reported here, the method has been further extended to the state-of-the-art bottleneck i-vector SLR system, and different ways of training and adapting the DNN tokenisers have been investigated

Unsupervised adaptation of speech tokeniser

DNN adaptation with cross-entropy training

DNN adaptation by uncertainty reweighting

Comparison of adaptation strategies

DNN tokenisers

DNN adaptation

Language recognition system

Score calibration and fusion

Language recognition results

Phonotactic LR system results

Bottleneck i-vector system results

Overall system fusion

Analysis on adapted DNN outputs

Findings

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Unsupervised crosslingual adaptation of tokenisers for spoken language recognition

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Computer Speech & Language

Lead the way for us

Similar Papers

Homogenous ensemble phonotactic language recognition based on SVM supervector reconstruction
Wei-Wei Liu ... Wei-Qiang Zhang
EURASIP Journal on Audio, Speech, and Music Processing | VOL. 2014
Wei-Wei Liu, et. al.Wei-Wei Liu ... Wei-Qiang Zhang
01 Dec 2014
EURASIP Journal on Audio, Speech, and Music Processing | VOL. 2014

Spoken language recognition using a new conditional cascade method to combine acoustic and phonetic results
Shabnam Gholamdokht Firooz ... Yasser Shekofteh
International Journal of Speech Technology | VOL. 21
Shabnam Gholamdokht Firooz, et. al.Shabnam Gholamdokht Firooz ... Yasser Shekofteh
28 Jun 2018
International Journal of Speech Technology | VOL. 21

Spoken Language Recognition With Prosodic Features
Raymond W M Ng ... Haizhou Li
IEEE Transactions on Audio, Speech, and Language Processing | VOL. 21
Raymond W M Ng, et. al.Raymond W M Ng ... Haizhou Li
01 Sep 2013
IEEE Transactions on Audio, Speech, and Language Processing | VOL. 21

A Discriminative Hierarchical PLDA-Based Model for Spoken Language Recognition
Luciana Ferrer ... Diego Castan
IEEE/ACM Transactions on Audio, Speech, and Language Processing | VOL. 30
Luciana Ferrer, et. al.Luciana Ferrer ... Diego Castan
01 Jan 2021
IEEE/ACM Transactions on Audio, Speech, and Language Processing | VOL. 30

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Unsupervised crosslingual adaptation of tokenisers for spoken language recognition

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Computer Speech &amp; Language

More From: Computer Speech & Language