Image harmonization seeks to transfer the illumination distribution of the background to that of the foreground within a composite image. Existing methods lack the ability of establishing global–local pixel illumination dependencies between foreground and background of composite images, which is indispensable for sharp and color-consistent harmonized image generation. To overcome this challenge, we design a novel Simple Hybrid CNN-Transformer Network (SHT-Net), which is formulated into an efficient symmetrical hierarchical architecture. It is composed of two newly designed light-weight Transformer blocks. Firstly, the scale-aware gated block is designed to capture multi-scale features through different heads and expand the receptive fields, which facilitates to generate images with fine-grained details. Secondly, we introduce a simple parallel attention block, which integrates the window-based self-attention and gated channel attention in parallel, resulting in simultaneously global–local pixel illumination relationship modeling capability. Besides, we propose an efficient simple feed forward network to filter out less informative features and allow the features to contribute to generating photo-realistic harmonized results passing through. Extensive experiments on image harmonization benchmarks indicate that our method achieve promising quantitative and qualitative results. The code and pre-trained models are available at https://github.com/guanguanboy/SHT-Net.