Abstract
The article is devoted to the development of means for recognition of the emotions of the speaker, based on the neural network analysis of fixed fragments of the voice signal. The possibility of improving recognition tools through the use of a convolutional neural network of the GoogLeNet type is shown. The expediency of training the network on the examples of the TESS database is determined. A procedure for representing a voice signal in the form of a square grayscale image has been developed. Based on the use of the specified database, the modification of the GoogLeNet network was carried out. The architecture and software implementation of the recognition module have been developed. With the help of computer experiments, it has been established that after 150 epochs of training on a fairly limited training sample, the modified GoogLeNet makes it possible to achieve an accuracy of recognition of the speaker's emotions of approximately 0.97 on test examples. The obtained accuracy is comparable to the accuracy of the best modern systems of this purpose and confirms the promise of using neural network models like GoogLeNet for emotion recognition based on voice signal analysis. The need for further research aimed at reducing the resource intensity of the neural network model is shown.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have