Efficient and robust automatic speech recognition (ASR) systems are in high demand in the present scenario. Mostly ASR systems are generally fed with cepstral features like mel-frequency cepstral coefficients and perceptual linear prediction. However, some attempts are also made in speech recognition to shift on simple features like critical band energies or spectrogram using deep learning models. These approaches always claim that they have the ability to train directly with the raw signal. Such systems highly depend on the excellent discriminative power of ConvNet layers to separate two phonemes having nearly similar accents but they do not offer high recognition rate. The main reason for limited recognition rate is stride based pooling methods that performs sharp reduction in output dimensionality i.e. at least 75%. To improve the performance, region-based convolutional neural networks (R-CNNs) and Fast R-CNN were proposed but their performances did not meet the expected level. Therefore, a new pooling technique, multilevel region of interest (RoI) pooling is proposed which pools the multilevel information from multiple ConvNet layers. The newly proposed architecture is named as multilevel RoI convolutional neural network (MR-CNN). It is designed by simply placing RoI pooling layers after up to four coarsest layers. It improves extracted features using additional information from the multilevel ConvNet layers. Its performance is evaluated on TIMIT and Wall Street Journal (WSJ) datasets for phoneme recognition. Phoneme error-rate offered by this model on raw speech is 16.4% and 17.1% on TIMIT and WSJ datasets respectively which is slightly better than spectral features.
Read full abstract