AbstractThis paper proposes recognition methods that use hierarchical spectral dynamic features extracted over multiple time lengths. The effectiveness of these features in phoneme recognition and isolated word recognition is shown. The dynamic features are extracted for time sequences of both cepstral coefficients and logarithmic energy, and are combined with instantaneous cepstral coefficients. Input speech is quantized by word‐or phoneme‐specific codebooks created by clustering a set of vectors consisting of these features for each word or phoneme.Speaker‐independent isolated word recognition experiments are performed using a vocabulary of 100 Japanese words. When VQ distortion is used for word identification, the high recognition accuracy of 96 percent is achieved, and when VQ distortion is used for preprocessing, the number of word candidates for each input utterance is reduced to 1 percent of the vocabulary without increasing the error rate. Phoneme recognition experiments are performed for the /b/, /d/ and /g/ consonants in a large vocabulary of isolated words uttered by one male speaker. Using the proposed recognition method, the high recognition accuracy of 99 percent, which is similar to the accuracies obtained by conventional HMM or neural network approaches, is obtained. This paper also compares multiple codebook and single code‐book methods.