Abstract
With the advance of semiconductor technology and the popularity of distributed speech/speaker recognition paradigm (e.g., Siri in iPhone4s), here we revisit the use of discrete model in automatic speech recognition (ASR) and speaker verification (SV) tasks. Compared with the dominant continuous density model, discrete model has inherently attractive properties: it uses non-parametric output distributions and takes only O(1) time to get the probability value from it; furthermore, the features used in the discrete model, compared with that in the continuous model, could be encoded in fewer bits, lowering the bandwidth requirement in distributed speech/speaker recognition architecture. Unfortunately, the recognition performance of a conventional discrete model is significantly worse than that of a continuous one due to the large quantization error and the use of multiple independent streams. In this thesis, we propose to reduce the quantization error of a discrete model by using a very large codebook with tens of thousands of codewords. The codebook of the proposed model is about a hundred times larger than that of a conventional discrete model, whose codebook size usually ranges from 256 to 1024. Accordingly, the number of parameters to specify a discrete output distribution grows by a hundred times in the proposed model. Compared with a discrete model of conventional sized codebook, there are two major challenges in building a very large codebook model. Firstly, given a continuous acoustic feature vector, how do we quickly find its corresponding codeword from a hundred-time larger codebook? Secondly, given the limited amount of training data, how can we robustly train such a high-density model, which has a hundred times more parameters than the conventional model? To find a codeword for an acoustic vector fast, we employ the subvector-quantized (SVQ) codebooks. SVQ codebooks represent a very large codebook in the full feature space by a combinatorial product of per-subvector smaller codebooks. To find a full space codeword is reduced to finding a set of SVQ codewords, which is very fast. To robustly train such a high-density model, two techniques are explored. The first one is to do model conversion. A discrete model is converted directly from a well-trained continuous model and avoids direct training using the training data. The second one is by subspace modeling. In this technique, the original high-density discrete distribution table is treated a high dimensional vector and assumed to lie in some low dimensional subspace. By this subspace representation, the number of free parameters in the model is reduced by ten and hundred fold. As a result, the model could be trained robustly using the limited amount of data. Experimental evaluations on both ASR and SV tasks show the feasibility and benefits of the very large codebook discrete model. On the WSJ0 (Wall Street Journal) ASR task, the proposed model shows comparable recognition accuracy as the continuous model with much faster decoding and lower bandwidth requirement. On the NIST (National Institute of Standards and Technology) 2002 SV task, a speedup of 8-25 fold is achieved with almost no loss in verification performance.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.