Abstract

Video Caption shows the objects, attributes and their relationship in natural language. It has been a very challenging research topic in the field of computer and multimedia. In this paper, the method of deep learning is used to extract the video frame feature, motion information, video sequence feature. And the multi-modal feature fusion method: feature cascade, model weighted average fusion are studied, and then the valuation is also studied. The experimental results show that the score of each evaluation in the model of weighted average fusion method is higher than that of the feature cascade method. The feature extraction methods and multimodal fusion methods in this paper have certain value for the application of video caption.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call