Social sensing, using humans as sensors to collect disaster data, has emerged as a timely, cost-effective, and reliable data source. However, research has focused on the textual data. With advances in information technology, multimodal data such as images and videos are now shared on media platforms, aiding in-depth analysis of social sensing systems. This study proposed an analytical framework to extract disaster-related spatiotemporal information from multimodal social media data. Using a pre-trained multimodal neural network and a location entity recognition model, the framework integrates disaster semantics with spatiotemporal information, enhancing situational awareness. A case study of the April 2024 heavy rain event in Guangdong, China, using Weibo data, demonstrates that multimodal content correlates more strongly with rainfall patterns than textual data alone, offering a dynamic perception of disasters. These findings confirm the utility of multimodal social media data and offer a foundation for future research. The proposed framework offers valuable applications for emergency response, disaster relief, risk assessment, and witness discovery, and presents a viable approach for safety risk monitoring and early warning systems.