Abstract. Automatic extraction of building footprints from aerial and space imageries has been found ever increasing importance in urban planning, disaster management, and environmental monitoring. However, achieving accurate building footprint extraction poses significant challenges due to diverse building characteristics and their similarities to their background elements. While conventional methods in building footprint extraction have mainly relied on image processing techniques, recent advancements in deep learning, particularly semantic segmentation algorithms like U-Net, have shown promise in addressing these challenges through machine learning. This study explores different depths of the U-Net model for building footprint extraction, aiming to identify the optimum architecture while investigating the semantic uncertainty of the building footprint extraction. Utilizing aerial imagery from cities including Berlin, Paris, Chicago, and Zurich, collected from Google Maps and OpenStreetMap (OSM) data, five U-Net models have been compared with varying depths. In addition, the impact of dataset sizes and learning rates on model performance has been investigated. Results confirmed that the U-Net-32-1024 model achieves the highest intersection over union (IoU), Accuracy, and F1-score. Moreover, increasing the training dataset size leads to significant improvements in model performance with IoU, Accuracy and F1-score reaching their values of 73.73%, 88.65% and 88.53%. However, challenges remain in accurately delineating buildings in dense urban areas. Nonetheless, our findings demonstrated the effectiveness of U-Net models in building footprint extraction.