This paper presents an unmanned aerial vehicle (UAV) identification and tracking system aimed at monitoring UAVs based on weakly supervised semantic segmentation. A camera is equipped with a pan–tilt to collect images for semantic segmentation network in real time. The GrabCut+ algorithm and annotation boxes are employed to generate the UAV “pseudo pixel labels” for supervised model learning and reduce labelling costs. A new loss function combining the focus loss function and dice loss function is designed to balance positive and negative samples and improve the segmentation effect. The Mixup method is introduced for model training to prevent overfitting and enhance the generalization ability of the model. The semantic segmentation network outputs the prediction results by a fully connected conditional random field to smooth the target image. Furthermore, a region-based tracking method is proposed to solve the hysteresis problem of the pan–tilt control system and improve the system tracking performance. Finally, an experiment based on a dataset is carried out to prove the effectiveness of the segmentation algorithm with 66.3% mIoU. Considering that 10% of the central area of view is specified as the view centre, a UAV falling in the centre of the field accounts for more than 80% of this view area, demonstrating the real-time effectiveness of the designed UAV identification and tracking system.