Due to the ubiquity of sensor-equipped cameras such as smartphones, images are associated with spatial metadata, including camera’s geographical location and viewing orientation, which can be used for automatically generating better semantic keywords about geo-tagged urban street images in addition to visual keywords extracted from image analysis. This study introduces a novel framework for auto-tagging images that integrates both spatial and visual properties to generate comprehensive and accurate tags. The framework operates through four phases: extraction, abstraction, composition, and assessment. Our research highlights the benefits of combining visual and spatial analyses, demonstrated through a case study using geo-tagged urban street images from Orlando, Pittsburgh, and Manhattan. Experimental results show that the proposed framework significantly enhances the accuracy of keyword-based searches compared to conventional methods. In particular, based on our experiments, image search using the tags generated by our proposed framework, referred to as descriptive tags, achieved an average precision improvement factor of 0.9 compared to conventional tags. Additionally, our proposed ranking algorithm, which extends the term frequency-inverse document frequency (TF-IDF) algorithm, resulted in improvement factors of 0.86 for mean average precision (MAP) and 0.57 for mean reciprocal rank (MRR). Moreover, our framework’s flexibility and robustness make it suitable for diverse applications, from smart cities to online shopping. The paper also includes a detailed evaluation and user study, confirming the precision and reliability of the generated tags.