Betula luminifera, an indigenous hardwood tree in South China, possesses significant economic and ecological value. In view of the current severe drought situation, it is urgent to enhance this tree’s drought tolerance. However, traditional artificial methods fall short of meeting the demands of breeding efforts due to their inefficiency. To monitor drought situations in a high-throughput and automatic approach, a deep learning model based on phenotype characteristics was proposed to identify and classify drought stress in B. luminifera seedlings. Firstly, visible-light images were obtained from a drought stress experiment conducted on B. luminifera shoots. Considering the images’ characteristics, we proposed an SAM-CNN architecture by incorporating spatial attention modules into classical CNN models. Among the four classical CNNs compared, ResNet50 exhibited superior performance and was, thus, selected for the construction of the SAM-CNN. Subsequently, we analyzed the classification performance of the SAM-ResNet50 model in terms of transfer learning, training from scratch, model robustness, and visualization. The results revealed that SAM-ResNet50 achieved an accuracy of 1.48% higher than that of ResNet50, at 99.6%. Furthermore, there was a remarkable improvement of 18.98% in accuracy, reaching 82.31% for the spatial transform images generated from the test set images by applying movement and rotation for robustness testing. In conclusion, the SAM-ResNet50 model achieved outstanding performance, with 99.6% accuracy and realized high-throughput automatic monitoring based on phenotype, providing a new perspective for drought stress classification and technical support for B. luminifera-related breeding work.