Abstract. The rapid development of urbanization presents challenges and requirements for multi-class object detection in urban scenes. Accurately identifying buildings, vehicles, and trees in urban scenes can optimize urban planning, traffic management, monitoring environmental conditions, and promote the development of smart cities. Traditional target detection methods perform poorly in complex urban environments, while deep learning technology achieves accurate target recognition and positioning by automatically extracting high-level semantic features. In this study, we chose to use the YOLOv5s algorithm for multi-class target detection in urban scenes. YOLOv5s is a lightweight deep learning model with small storage space and efficient detection speed. In this paper, the Potsdam area data published by ISPRS is used to make the label data of buildings, vehicles and trees. The YOLOv5s algorithm is used to iteratively train the model. The results show that the mAP value detected by the YOLOv5s model can reach 82.83%. The experimental results show that the algorithm shows higher accuracy than SSD and Faster R-CNN in tree detection. Although it has a slight decline in building and vehicle detection, considering the factors such as detection accuracy, speed, and model size, the YOLOv5s algorithm has a better recognition and detection effect for the detection of multi-class targets in urban scenes.