Microblogging services have changed the way that people exchange information. There will generate a large number of data on the web once popular events or emergencies occur, including textual descriptions about the time, location and details for the event. Meanwhile users can review, comment, spread the event conveniently. It has always been a hot issue that how to use this mass of data to detect and predict breaking events. While existing approaches mostly only focus on event detection, event location estimation and text-based summary, a small amount of works have focused on event summarization. In this paper, we put forward a new social media based event summarization framework, which comprises of three stages: (1) A coarse-to-fine filtering model is exploited to eliminate irrelevant information. (2) A novel User–Text–Image Co-clustering (UTICC) is proposed to jointly discover subevents from microblogs of multiple media types—user, text, and image. (3) A multimedia event summarization process is designed to identify both representative texts and images, which are further aggregated to form a holistic visualized summary for the events. We conduct extensive experiments on Weibo dataset to demonstrate the superiority of the proposed framework compared to the state-of-the-art approaches.