Timely understanding of affected areas during disasters is essential for the implementation of emergency response activities. As one of the low-cost and information-rich volunteer geographic information, social media data can reflect geographic events through human behavior, which is a powerful supplementary source for fine-grained flood monitoring in urban areas. However, the value of social media data has not been fully exploited as potential location and water depth information may be embedded in both text and images. In this study, we propose a novel framework for fine-grained information extraction and dynamic spatial-temporal awareness in disaster-stricken areas based on Sina Weibo. First, we construct a novel fine-grained location corpus specifically for urban flooding contexts. The corpus summarizes characteristics of address descriptions in flood-related Weibo texts, including standard address entities and spatial relationship entities, based on the named entity recognition (NER) model. Then, water depth information in texts and images is obtained based on different deep learning modules and fused at the decision level. Specifically, in text analysis module, we summarize and extract diverse descriptions of water depth, and in image analysis module, we develop a water level hierarchical mapping method. Finally, we analyze the spatio-temporal distribution characteristics and variation patterns of the extracted information to enhance situational awareness. Taking the urban flood occurred in Anhui, China as a case study, we find that the variation of flooding hotspot areas in Sina Weibo and rainfall centers show a significant spatial and temporal consistency, and the fusion of text and image-based information can facilitate dynamic perception of flood processes. The framework presented in this study provides a feasible way to implement refined situational awareness and spatio-temporal evolution analysis of urban floods at the city level in time.