Abstract

With the rapid development of information technology in modern society, the application of multimedia integration platform is more and more extensive. Speech recognition has become an important subject in the process of multimedia visual interaction. The accuracy of speech recognition is dependent on a number of elements, two of which are the acoustic characteristics of speech and the speech recognition model. Speech data is complex and changeable. Most methods only extract a single type of feature of the signal to represent the speech signal. This single feature cannot express the hidden information. And, the excellent speech recognition model can also better learn the characteristic speech information to improve performance. This work proposes a new method for speech recognition in multimedia visual interaction. First of all, this work considers the problem that a single feature cannot fully represent complex speech information. This paper proposes three kinds of feature fusion structures to extract speech information from different angles. This extracts three different fusion features based on the low-level features and higher-level sparse representation. Secondly, this work relies on the strong learning ability of neural network and the weight distribution mechanism of attention model. In this paper, the fusion feature is combined with the bidirectional long and short memory network with attention. The extracted fusion features contain more speech information with strong discrimination. When the weight increases, it can further improve the influence of features on the predicted value and improve the performance. Finally, this paper has carried out systematic experiments on the proposed method, and the results verify the feasibility.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call