Social media platforms are turning into important news sources since they provide real-time information from different perspectives. However, high volume, dynamism, noise, and redundancy exhibited by social media data make it difficult to comprehend the entire content. Recent works emphasize on summarizing the content of either a single social media platform or of a single modality (either textual or visual). However, each platform has its own unique characteristics and user base, which brings to light different aspects of real-world events. This makes it critical as well as challenging to combine textual and visual data from different platforms. In this article, we propose summarization of real-world events with data stemming from different platforms and multiple modalities. We present the use of a Markov Random Fields based similarity measure to link content across multiple platforms. This measure also enables the linking of content across time, which is useful for tracking the evolution of long-running events. For the final content selection, summarization is modeled as a subset selection problem. To handle the complexity of the optimal subset selection, we propose the use of submodular objectives. Facets such as coverage, novelty, and significance are modeled as submodular objectives in a multimodal social media setting. We conduct a series of quantitative and qualitative experiments to illustrate the effectiveness of our approach compared to alternative methods.