Abstract The underground coal gangue separation and in-situ filling can reduce environmental pollution, promote the recycling of resources, and ensure the safe operation of mining. However, the harsh environment and abnormal working conditions are a significant challenge to the separation technology. Therefore, it is essential to develop a coal gangue classification method that is highly accurate, robust, and can handle abnormal working conditions. To address the above problems, this paper innovatively combines spectral modalities with image modalities to establish a multimodal fusion idea of composite fusion. Firstly, the feasibility of spectral-image fusion and effective fusion criterion are explored under the concat fusion strategy through various feature combinations and classification algorithms under ideal conditions to improve the performance of the model; Secondly, feature fusion is introduced into the single-layer perceptron and its potential in deep learning is explored to improve the performance of the model; Then the quantitative criteria of the judgment matrix are improved based on the analytical hierarchy method (AHP) to improve the scientificity and objectivity of decision making; Finally, the effectiveness of our method is verified by testing the bimodal dataset of simulated working conditions. The results show that the accuracy of the composite fusion of spectral and image features reaches 91.43%, and our AHP can be applied to all basic model scenarios, which makes the method highly applicable and feasible. The fusion of deep neural networks shows the strong potential of modal fusion in deep learning. This method can provide a new idea for intelligent separation of underground coal gangue.