Recently, hidden Markov models (HMM) have been applied successfully to both isolated and connected word recognition. However, when the same formulation is adopted for recognition of more confusable vocabularies such as English alphabets, the recognition performance is often less satisfactory. The main reason is that a more accurate model is required. Such a model should be more robust against small training sample sizes and should also be properly initialized so that the finer features used to discriminate confusable word pairs can be extracted. In this paper, three specific robustness issues will be investigated: choice of observation densities, model initialization, and incorporation of duration information. In a step-by-step attempt to address those issues, it was found that the same HMM formulation can still be adopted if acoustic/phonetic knowledge about the vocabulary is taken into account in the model parameter estimation and recognition phases. Testing on a 39-word English alpha-digit vocabulary, in a speaker trained mode, indicates that the recognition performance can be significantly improved and the results are comparable to the template-based DTW recognizer if model parameters are properly initialized and durational information is adquately incorporated.