Recently, social media has gradually become an important topic in public opinion analysis field. Among social media, Microblog is one of the most important platforms because it is short, convenient, mobile and instantaneous. Social media microblog recognition well reflects the attitudes from an enormous big colony to a specific incident, either positive or negative, which can be used for deriving competitive intelligence, marketing strategies, detecting depression and so on. However, the existing methods usually use only text or image from internet but not take advantages of their complementary information to finalize the recognition, it limits the performance and robustness of the algorithms. In this paper, we present a collaborate decision network (CDN) based on cross-modal attention to exploit the discriminative attributes of multi modalities by data- and knowledge joint driven strategy in depth, and further improve the recognition performance. In addition, we collect and construct a visual-text microblog recognition dataset with 2854 samples to support the subsequent research of related fields. Finally, experimental reuslts on the collected dataset show the effectiveness and superiority of the proposed CDN.