Abstract

Understanding human speech precisely by a machine has been a major challenge for many years.With Automatic Speech Recognition (ASR) being decades old and considering the advancement of the technology, where it is not at the point where machines understand all speech, it is used on a regular basis in many applications and services. Hence, to advance research it is important to identify significant research directions, specifically to those that have not been pursued or funded in the past. The performance of such ASR systems, traditionally build upon an Hidden Markov Model (HMM), has improved due tothe application of Deep Neural Networks (DNNs). Despite this progress, building an ASR system remained a challenging task requiring multiple resources and training stages. The idea of using DNNs for Automatic Speech Recognition has gone further from being a single component in a pipeline to building a system mainly based on such a network.This paper provides a literature survey on state of the art researches on two major models, namely Deep Neural Network - Hidden Markov Model (DNN-HMM) and Recurrent Neural Networks trained with Connectionist Temporal Classification (RNN-CTC). It also provides the differences between these two models at the architectural level.

Highlights

  • T HE technology of Automatic Speech recognition (ASR) concedes a system to recognize human speech and produce the output

  • This paper provides a literature survey on state of the art researches on two major models, namely Deep Neural Network - Hidden Markov Model (DNN-HMM) and Recurrent Neural Networks trained with Connectionist Temporal Classification (RNN-CTC)

  • While there is a small advancement of the baseline system from ”14 Hr” to the ”81 Hr” training set, there is a huge decline in the error rate of the RNN

Read more

Summary

INTRODUCTION

T HE technology of Automatic Speech recognition (ASR) concedes a system to recognize human speech and produce the output. The system produces a speech waveform which epitomizes the words of the sentence as well as the vocalized pauses in the input in the form as speech. The system decodes the speech providing the best fit of the sentence. It converts the speech signal into a sequence of vectors that is measured at the duration of the speech signal. The ASR systems that is available do not need a long period of speech training and can successfully recognize uninterrupted speech with large set of vocabulary with high accuracy rate.

Infrastructure
Models and Algorithms
Search
Metadata
HIDDEN MARKOV MODEL SYSTEMS
Deep Neural Network Hybrids
Training of a DNN-HMM Based System
RECURRENT NEURAL NETWORK SYSTEMS
Recurrent Neural Networks
Decoding
Decoding with a Language Model
EXPERIMENTS
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call