Digital speech recognition is a challenging problem that requires the ability to learn complex signal characteristics such as frequency, pitch, intensity, timbre, and melody, which traditional methods often face issues in recognizing. This article introduces three solutions based on convolutional neural networks (CNN) to solve the problem: 1D-CNN is designed to learn directly from digital data; 2DS-CNN and 2DM-CNN have a more complex architecture, transferring raw waveform into transformed images using Fourier transform to learn essential features. Experimental results on four large data sets, containing 30,000 samples for each, show that the three proposed models achieve superior performance compared to well-known models such as GoogLeNet and AlexNet, with the best accuracy of 95.87%, 99.65%, and 99.76%, respectively. With 5-10% higher performance than other models, the proposed solution has demonstrated the ability to effectively learn features, improve recognition accuracy and speed, and open up the potential for broad applications in virtual assistants, medical recording, and voice commands.
Read full abstract