With the advent of the information age, language is no longer the only way to construct meaning. Besides language, a variety of social symbols, such as gestures, images, music, three-dimensional animation, and so on, are more and more involved in the social practice of meaning construction. Traditional single-modal sentiment analysis methods have a single expression form and cannot fully utilize multiple modal information, resulting in low sentiment classification accuracy. Deep learning technology can automatically mine emotional states in images, texts, and videos and can effectively combine multiple modal information. In the book Image Reading, the first systematic and comprehensive visual grammatical analysis framework is proposed and the expression of image meaning is discussed from the perspectives of representational meaning, interactive meaning, and composition meaning, compared with the three pure theoretical functions in Halliday's systemic functional grammar. In the past, people often discussed films from the macro perspectives of literary criticism, film criticism, psychology, aesthetics, and so on, and multimodal analysis theory provides film researchers with a set of methods to analyze images, music, and words at the same time. In view of the above considerations, Mu Wen adopts the perspective of social semiotics, based on Halliday's systemic functional linguistics and Gan He's “visual grammar,” and builds a multimodal interaction model as a tool to analyze film discourse by referring to evaluation theory.