Action recognition has seen significant advancements with the integration of spatio-temporal representations, particularly leveraging skeleton-based models and cross-modal data fusion techniques. However, existing approaches face challenges in capturing long- range dependencies within the human body skeleton and effectively balancing features from diverse modalities. To address these limitations, a novel framework, the Dynamic Spatio-Temporal Graph Attention Transformer (D-STGAT), is proposed, which seamlessly integrates the strengths of dynamic graph attention mechanisms and transformer architectures for enhanced action recognition. The framework builds upon recent innovations in graph attention networks (GAT) and transformer models. First, the Spatial-Temporal Dynamic Graph Attention Network (ST-DGAT) is introduced, extending traditional GAT by incorporating a dynamic attention mechanism to capture spatial- temporal patterns within skeleton sequences. By reordering the weighted vector operations in GAT, the approach achieves a global approximate attention function, significantly enhancing its expressivity and capturing long-distance dependencies more effectively than static attention mechanisms. Furthermore, to address the challenges of cross-modal feature representation and fusion, the spatio-temporal Cross Attention Transformer (ST-CAT) is introduced. This model efficiently integrates spatio-temporal information from both video frames and skeleton sequences by employing a combination of full spatio-temporal attention (FAttn), zigzag spatio-temporal attention (ZAttn), and binary spatio-temporal attention (BAttn) modules. Through the proper arrangement of these modules within the transformer encoder and decoder, ST-CAT learns a multi-feature representation that effectively captures the intricate spatiotemporal dynamics inherent in action recognition tasks. Experimental results on the Penn- Action, NTU-RGB+D 60, and 120 datasets showcase the efficacy of the approach, yielding promising performance improvements over previous state-of-the-art methods. In summary, the proposed D-STGAT and ST-CAT frameworks offer novel solutions for action recognition tasks by leveraging dynamic graph attention mechanisms and transformer architectures to effectively capture and fuse spatiotemporal features from diverse modalities, leading to superior performance compared to existing approaches.