The coal gangue generated during coal mining causes serious environmental pollution. To efficiently screen coal gangue, we developed an efficient approach combining multi-spectral imaging with an improved Real-Time Detection Transformer (RT-DETR) for coal gangue detection. Firstly, spectral data from 25 bands were optimized into 3 bands and combined into a pseudo RGB image. Then, the backbone network was enhanced through lighter Faster-Block and Efficient Multi-scale Attention (EMA) mechanisms, with learnable position coding introduced in the Attention based Intra scale Feature Interaction (AIFI) layer and the Re-param Convolutional (RepC3) structure of the Cross scale Feature Fusion Module (CCFM) layer upgraded to Dilated Re-param Block (DRB). Additionally, we propose a Stable Diffusion model based on Low-Rank Adaptation (LoRA) fine-tuning for image generation and enhance the model’s robustness by combining it with data augmentation to simulate specific environments. Our results showed that the method achieves detection accuracy, recall, and mean average precision of 92.18%, 86.78%, and 91.95%, respectively. The lightweight model and high detection accuracy provide an effective solution for mine sorting.