Abstract

Today, the internet traffic is dominated by video data. It leads many researchers to develop an audiovisual automatic speech recognition (AVSR). They have proven that AVSR is more accurate and more resistant to noise than the audio-based automatic speech recognition (ASR). However, there are three issues in developing an AVSR system, namely: how to create an optimum combination of audio and visual features; the acoustic model based on phonemes and graphemes/characters/letters is commonly not robust to noise; and the feature extraction that is generally based on Mel Frequency Cepstral Coefficient (MFCC) and Gaussian Mixture Model (GMM) has high complexity. This paper describes the development of a syllable-based Indonesian AVSR system (INAVSR) using the fusion of both audio and visual features. The system is developed using a Hidden Markov Toolkit (HTK) along with visual feature extraction using both discrete cosine transform (DCT) and principal component analysis (PCA). The dataset of 43 recorded videos with resolution of $640 \times 360$ pixels, 25 frames per second, and the audio sample rate of 16 kHz is also developed to evaluate the system. The dataset is split into two subsets: 28 videos for the training set and 15 videos for the testing set. The evaluation shows that the developed system is capable of absolutely reducing the word error rate (WER) produced by the audio-based ASR by up to 6.07%.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call