Character reenactment aims to control a target person’s full-head movement by a driving monocular sequence that is made up of the driving character video. Current algorithms utilize convolution neural networks in generative adversarial networks, which extract historical and geometric information to iteratively generate video frames. However, convolution neural networks can merely capture local information with limited receptive fields and ignore global dependencies that play a crucial role in face synthesis, leading to generating unnatural video frames. In this work, we design a progressive transformer module that introduces multi-head self-attention with convolution refinement to simultaneously capture global-local dependencies. Specifically, we utilize the non-lapping window-based multi-head self-attention mechanism with hierarchical architecture to obtain the larger receptive fields at low-resolution feature map and thus extract global information. To better model local dependencies, we introduce the convolution operation to further refine the attentional weight in the multi-head self-attention mechanism. Finally, we use several stacked progressive transformer modules with the down-sampling operation to encode information of appearance information of previously generated frames and parameterized 3D face information of the current frame. Similarly, we use several stacked progressive transformer modules with the up-sampling operation to iteratively generate video frames. In this way, it can capture global-local information to facilitate generating video frames that are globally natural while preserving sharp outlines and rich detail information. Extensive experiments on several standard benchmarks suggest that the proposed method outperforms current leading algorithms.