Slip detection plays a crucial role in robotic operations and has received increasing attention in the field of robotics. We introduced a slip detection model called VT-VT, based on the fusion of vision and touch using a transformer architecture. The model leverages the advantages of a 'divided spatial-temporal attention' mechanism, capturing global contextual information more effectively and exhibiting high sensitivity to potential temporal features during the slipping process. To validate the effectiveness of VT-VT, we conducted experiments on a public slip detection dataset, achieving a test accuracy of up to 90.52%, significantly outperforming a convolutional neural network–long short-term memory model combined with different feature extraction networks. Furthermore, this study revealed the impact of different patch size values, different modal perception modes, and varying lighting conditions on the performance of the VT-VT model. Moreover, we analyzed the real-time capabilities of the VT-VT model.