The Czochralski method is the primary technique for single-crystal silicon production. However, anomalous states such as crystal loss, twisting, swinging, and squareness frequently occur during crystal growth, adversely affecting product quality and production efficiency. To address this challenge, we propose an enhanced multimodal fusion classification model for detecting and categorizing these four anomalous states. Our model initially transforms one-dimensional signals (diameter, temperature, and pulling speed) into time–frequency domain images via continuous wavelet transform. These images are then processed using a Dense-ECA-SwinTransformer network for feature extraction. Concurrently, meniscus images and inter-frame difference images are obtained from the growth system’s meniscus video feed. These visual inputs are fused at the channel level and subsequently processed through a ConvNeXt network for feature extraction. Finally, the time–frequency domain features are combined with the meniscus image features and fed into fully connected layers for multi-class classification. The experimental results show that the method can effectively detect various abnormal states, help the staff to make a more accurate judgment, and formulate a personalized treatment plan for the abnormal state, which can improve the production efficiency, save production resources, and protect the extraction equipment.