Abstract

ABSTRACTIn this paper we study various front-endfeatures, mod-eling and adaptation algorithms on the Aurora 3 databases,including auditory, moment, and AM-FMmodulation fea-tures, context-dependentdigit models, segmental K-meanstraining, discriminative training, and model adaptations.The evaluation results on Aurora 3 are presented with abrief summary of our Aurora 2 results.1. INTRODUCTIONThe Aurora evaluation is for researchers to test their algo-rithms on noise robustness and compare results measuredon the same databases. So far, there are two tasks on theAurora evaluation, Aurora 2 and 3, both are for connecteddigit recognition. While the Aurora 2 databases use thecontrolled experiments by adding noise digitally to cleanEnglish digit strings [1], the Aurora 3 databases are col-lected in a real-worldcar environment in 4 languages. Inthis paper, we report our evaluation results on two of thelanguages, Spanish and German.2. BELL LABS APPROACHESIn this section, we present our baseline system then describethe different feature sets that have been used for this eval-uation. Alternative training strategies and acoustic modeladaptation techniques are also reviewed.A. Context-DependentModel: Similar to last year ap-proach [1], we have decided to use context-dependent(CD)digit models, together with Bell Labs recognition engine asbackend. This contrasts with the officialAurora backendthat is based on whole-worddigit models and the HTK en-gine. The officialbackend setup typically leads to poorerresults, especially in larger databases, and we believe that abetter baseline is beneficialto properly study the effect ofdifferent front-endson the finalrecognition performance.Last year, we investigated several approaches to buildCD digit models. Given the limited amount of trainingdata, especially in the Aurora3 databases, it is required torely on some tying techniques to build CD digit models.The Head-Body-Tail digit model structure (HBT) assumesthat CD digit models are built by concatenating a left-context-dependentunit (head) with a context-independentunit(body)followedbyaright-context-dependentunit(tail).For example, assuming that the lexicon contains 10 digitsplus a silence model, each digit model consists of a set of 1body, 11 heads and 11 tails (representing all left/right con-texts) [2]. We typically model each head and tail with a3-stateHMM, while a 4-stateHMM is used for each body.Most of the experiments done this year have been based onthe HBT structure. CD digit models can also be built as tri-phone modelsusing a decision tree. This is the approach weintroduced last year [1], and some of this year experimentshave been carried out using this model topology.B. Auditory Feature: The new auditory front-endinour recognition system was developed to mimic the robusthuman hearing in adverse acoustic environments [3, 4]. Inthe front-end,efficientsignal processing functions were im-plemented to satisfy both real-timeand computation costrequirements. Based on the analysis of the outer and mid-dle ear, a transfer function was constructed to replace thecommonly used preemphasis filter, and then a new set ofdigital auditory filters,which simulate auditory filteringinthe cochlea, replaces those used in the MFCC and PLP.The auditory feature extraction procedure consists of: anouter-middle-eartransfer function, FFT, frequency conver-sion from linear to the Bark scale, auditory filtering,non-linearity,and discrete cosine transform (DCT). In our previ-ous study[3], the feature has been evaluated in two tasks:connected-digit and large vocabulary, continuous speechrecognitionundervariousnoiseconditions,usingbothhand-setand hands-freedatainlandlineand wirelesstransmissionwith additive car and babble noise. Compared with theLPCC, MFCC, MEL-LPCC,and PLP features, the audi-tory feature achieved significantperformance improvement

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call