The rise of Deepfake technology has sparked significant concerns due to its potential for misuse and malicious manipulation of multimedia content. Various detection approaches aimed at detecting Deepfake videos have been proposed, mostly relying on the identification of spatial and temporal artifacts. However, due to the different contexts of source images and the variety of generation techniques, current Deepfake detection methods usually perform well on training datasets, and yet generalize poorly to those unseen identities in new datasets. This issue is widely known as the generalization challenge of Deepfake detection. To address this challenge, this paper proposes an advanced spatiotemporal Deepfake video detector, named Motion-enhanced Spatiotemporal Transformer (MeST-Former). MeST-Former is based on the spatiotemporal modelling capacity of the video Swin Transformer. The spatial and temporal features are obtained from the RGB and motion images, respectively. To enhance the generalization ability of MeST-Former to unseen identities in unseen datasets, the ID-related components in the spatial and temporal features are detached. Specifically, MeST-Former adopts the newly proposed Identity-Decoupling Attention (IDC-Att) module to disentangle the ID-related and ID-unrelated components. Only the ID-unrelated components are used to construct more generalizable spatiotemporal representations. This process makes the constructed spatiotemporal features identity-agnostic and more generalizable to unseen identities. We conducted extensive experiments to evaluate the performance of the MeST-Former. Our results indicate that MeST-Former achieves accurate and generalizable Deepfake detection performance. Notably, MeST-Former also demonstrates high efficacy in detecting AI-animated talking-head videos.