Real-time quality monitoring through molten pool images is a critical focus in researching high-quality, intelligent automated welding. However, challenges such as the dynamic nature of the molten pool, changes in camera perspective, and variations in pool shape make defect detection using single-frame images difficult. We propose a multi-scale fusion method for defect monitoring based on molten pool videos to address these issues. This method analyzes the temporal changes in light spots on the molten pool surface, transferring features between frames to capture dynamic behavior. Our approach employs multi-scale feature fusion using row and column convolutions along with a gated fusion module to accommodate variations in pool size and position, enabling the detection of light spot changes of different sizes and directions from coarse to fine. Additionally, incorporating mixed attention with row and column features enables the model to capture the characteristics of the molten pool more efficiently. Our method achieves an accuracy of 97.416% on a molten pool video dataset, with a processing time of 16 ms per sample. Experimental results on the UCF101-24 and JHMDB datasets also demonstrate the method's generalization capability.