In recent years, the application of video Internet of Things (IoT) in various cities and public places has brought unprecedented opportunities to the security field and achieved great success. However, the latest research shows that video recognition models are also vulnerable to adversarial examples, but adversarial examples based on physical attacks are easily detected by humans, making it difficult to pass human review. To address this problem, in this paper, we propose to introduce a novel Multi-granular Spatio-temporal Attention Network (MSANet), which can attack the video action recognition models imperceptibly. Specifically, to exploit video motion information more effectively and to reduce the detectability of attack perturbations, we design a multiplexed spatio-temporal attention module to select and enhance spatial regions and temporal frames at coarse-grained and fine-grained levels respectively, thus maintaining a certain degree of smoothness while reducing the perturbation size and avoiding attacking overfitting. In addition, our proposed MSANet achieves imperceptible perturbations to video sequences through alternate iterative optimization combined with the PGD attack mechanism. The extended experimental results on two different models (e.g., TDN and TSM) and two widely-used datasets (HMDB-51 hm and UCF-101 ucf), compared to the state-of-the-art model, demonstrate the effectiveness of our devised video action recognition attack approach.