Abstract

In current society, speech recognition can perform a variety of functions, such as completing voice commands, enabling speech processing, spoken language translation and facilitating communication. Therefore, the study of speech recognition technology is of high value. However, current speech recognition techniques focus on among clearly expressed spoken words, which poses great challenges for recognition with spoken pronunciation or dialect pronunciation. Some scholars currently use a model combining time-delay neural networks and long and short-term memory networks to build speech recognition systems, but the performance in acoustic recognition is poor. Therefore, the study proposes a convolutional neural network (CNN), time-delay neural network (TDNN) and output-gate projected Gated recurrent by analyzing the deep neural network unit (OPGRU) combined with a composite English speech recognition model. The model can optimize the acoustic model after the introduction of CNN, and the model can accurately recognize pronunciation features and make the model have a wider recognition range. The proposed composite model is compared with the Word error rate (Wer) and runtime metrics in the Mozilla Common Voice dataset. The Wer result of the composite model is 23.42% and the running time is 1418 s. The Wer result of the composite model is 24.61% and the running time is 1385 s. Compared with the TDNN-OPGRU model, the Wer of the composite model decreases by 1.19% but the running time increases by 33 s. The accuracy of the composite model is higher than that of the TDNN-OPGRU model. From a comprehensive consideration, the speech recognition model accuracy has higher priority, so the composite model proposed in the study has better performance.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call