Abstract

In recent years, the use of multimodal human-computer interaction technology to achieve the enhancement of human intelligence has become a new topic in human-computer interaction research. When the robot can’t react correctly in a single mode, it is necessary to realize multimodal fusion. To this end, this paper proposes a multimodal fusion algorithm that applies the data obtained by the CNN feature layer to the decision-level. The speech recognition text is semantically matched with the text in the text library, and the similar probability vector is returned. At the same time, the similarity probability vector of the gesture recognition is obtained, and the data is filtered by the threshold, and the set of high probability data codes is assigned to the two modes. The intersection operation, and the final instruction is sent to the robot. The experimental results show that the influence of environmental factors on the single channel result is reduced, and the single mode ambiguity problem is eliminated. The multi-channel fusion algorithm with additional weight is more accurate than the common multi-channel fusion algorithm. At the same time, it has also been well received by many test users.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call