Building usage maps are inputs to many urban planning applications, however, the existing methods and the available data have limitations in generating instance-level high-resolution usage maps. In this study we tackle this problem by utilizing Street View Images (SVIs) and proposing a novel ensemble learning architecture that leverages building facade features and text extracted from hoardings, posters, etc. on buildings to predict the usage class. A pre-trained object detection model i.e., Grounding DINO, is implemented to efficiently identify buildings. A novel manually labeled training data of detected buildings corresponding to their usage is used to extract features from building facades across diverse Indian cities (Hyderabad, Mumbai, Bangalore, Delhi) using Vision Transformer (ViT) model. Following this, CLIPSeg a pre-trained segmentation model is used to recognizes text specifically on building elements like signs, posters, and banners. We then leverage GPT-3.5 Turbo, a Large Language Model (LLM), fine-tuned with a specifically designed few-shot prompting method, to infer building usage from the recognized text. To achieve optimal performance, the proposed ensemble linear metaclassifier combines predictions from ViT and LLM model. The predicted building usages are attributed to their corresponding locations to develop spatial maps. An analysis of our framework compared against ground truth data collected from various Indian cities reveals significantly accurate outcomes. Our findings highlight the utility of textual information in classifying utilities and commercial buildings, while features extracted from vision models prove more informative for residential buildings. Our approach can automate the generation of roadside building attributes and usage details on a larger scale.