Vision-based weld seam tracking has become one of the key technologies to realize intelligent robotic welding, and weld deviation detection is an essential step. However, accurate and robust detection of weld deviations during the microwelding of ultrathin metal foils remains a significant challenge. This challenge can be attributed to the fusion zone at the mesoscopic scale and the complex time-varying interference (pulsed arcs and reflected light from the workpiece surface). In this paper, an intelligent seam tracking approach for foils joining based on spatial–temporal deep learning from molten pool serial images is proposed. More specifically, a microscopic passive vision sensor is designed to capture molten pool and seam trajectory images under pulsed arc lights. A 3D convolutional neural network (3DCNN) and long short-term memory (LSTM)-based welding torch offset prediction network (WTOP-net) is established to implement highly accurate deviation prediction by capturing long-term dependence of spatial–temporal features. Then, expert knowledge is further incorporated into the spatio-temporal features to improve the robustness of the model. In addition, the slime mould algorithm (SMA) is used to prevent local optima and improve accuracy, efficiency of WTOP-net. The experimental results indicate that the maximum error detected by our method fluctuates within ± 0.08 mm and the average error is within ± 0.011 mm when joining two 0.12 mm thickness stainless steel diaphragms. The proposed approach provides a basis for automated robotic seam tracking and intelligent precision manufacturing of ultrathin sheets welded components in aerospace and other fields.