With the rapid advancements in deep learning technology, the Transformer-based attention neural network has shown promising performance in keyword spotting (KWS). However, this method suffers from high computational cost since the excessive parameters in the Transformer model and the computational burden of global attention, which limit its applicability in a resource-constrained KWS scenario. To overcome this issue, we propose a novel Swin-Transformer based KWS method. In this approach, first extract dynamic features using Temporal Convolutional Network (TCN) from input Mel-Frequency Cepstral Coefficients (MFCCs). Then, the Swin-Transformer is employed to capture hierarchical multi-scale features, where a window attention is designed to grasp dynamic time–frequency features. Furthermore, to enhance the extraction of contextual information from the spectrogram, a frame-level shifted window attention mechanism is proposed to enhance the inter-window interaction, thus extracting more contextual information from the spectrogram. Experimental results on the speech command V1 dataset verify the effectiveness of the proposal, which achieves a recognition accuracy of 98.01% with less model parameters, outperforming existing KWS methods.