With the intelligent development of industrial production, sound monitoring technology has been widely used to monitor the operation status of mechanical facilities, and this progress has gradually become a research hotspot in the steel manufacturing industry. However, the complex composition of sound sources and the high computational requirements of most models limit their applicability to industrial scenarios. This work presents a targeted approach for remote monitoring of rolling sound. They are using visual features of audio signals to design classification models on the self-harvested dataset for more efficient adaptation to complex production sites. An in-depth analysis of the actual sound reveals that it is characterized by high similarity, complexity, and partial synchronization. To optimize interclass and intraclass data engineering few-sample, multi-feature fusion, and data augmentation methods were combined to characterize the details fully. In addition, by leveraging the deep small convolutions of VGGNet and the randomness of stochastic pooling, local features are effectively extracted. Finally, global average pooling with final softmax layer stacking is used to classify the rolling signals to reduce the number of parameters, avoid overfitting, and perform a global analysis of the features. Experimental results on the rolling sound dataset show that the method proposed achieves an accuracy of 91.26%, with 92.66%, and 95.88% on the ESC10 and MIMII datasets, respectively. These results confirm that the method can be widely applied to sound classification for multi-category rolling processes, showing good performance and scalability.