Abstract
Automatic speech recognition is generally analyzed for two types of word utterances; isolated and continuous-words speech. Continuous-words speech is almost natural way of speaking but is difficult to be recognized through machines (speech recognizers). It is also highly sensitive to environmental variations. There are various parameters which are directly affecting the performance of automatic speech recognition like size of datasets/corpus, type of data sets (isolated, spontaneous or continuous) and environment variations (noisy/clean). The performance of speech recognizers is generally good in clean environments for isolated words, but it becomes typical in noisy environments especially for continuous words/sentences and is still a challenge. In this paper, a hybrid feature extraction technique is proposed by joining core blocks of PLP (perceptual linear predictive) and Mel frequency cepstral coefficients (MFCC) that can be utilized to improve the performance of speech recognizers under such circumstances. Voice activity and detection (VAD)-based frame dropping formula has been used solely within the training part of ASR (automatic speech recognition) procedure obviating its need in actual implementations. The motivation to use this formula is for removal of pauses and distorted elements of speech improving the phonemes modeling further. The proposed method shows average improvement in performance by 12.88% for standard datasets.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.