Abstract

In this article we report on experiments on detection of vowel segments in speech with additive noise. Deep neural networks have become the key algorithm in the majority of modern machine learning solutions. We investigate the performance of four ImageNet convolutional neural network (CNN) architectures. Usage of image processing CNNs is enabled by transforming the speech segments into spectrograms before the classification takes place. We perform experiments on TIMIT speech dataset and noise from datasets MAVD and ESC-50. The accuracy of individual architectures did not vary significantly among architectures on the dataset with added noise. However, accuracy of various architectures did differ significantly when applied to noise with absent speech.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call