Speech is a key means of communication. Nowadays, speech is becoming a more common, if not standard, interface to technology. This will be seen within the trend of technology changes over the years. Increasingly, voice is employed to regulate programs, appliances and private devices within homes, cars, workplaces, and public spaces through smartphones and residential assistant devices using Amazon's Alexa, Google's Assistant and Apple's Siri, and other proliferating technologies. This is often achievable with the help of Automatic Speech Recognition (ASR). Automatic Speech Recognition is a process that accurately translates spoken utterances into text. These technologies enable machines to reply correctly and reliably to human voices and supply useful and valuable services. As communicating with computer is quicker using voice instead of using keyboard, so people will prefer such system. Communication among the person is dominated by speech, therefore it’s natural for people to expect voice interfaces with computer. This can be accomplished by developing speech to text which allows computer to translate voice request and dictation into text. The three models in traditional ASR system are acoustic model, language model and lexicon model. The challenges involved in Automatic Speech Recognition are different styles of speech, environment which include background noise and also accent of speaker. To mitigate these challenges, deep learning models are utilized. The main idea is to analyses features of input audio signals such as spectrogram and MFCC and to develop cutting edge deep learning models. The proposed end-to-end model achieved an error rate of 0.60 on Librispeech dataset.
Read full abstract