PET/CT devices typically use CT images for PET attenuation correction, leading to additional radiation exposure. Alternatively, in a standalone PET imaging system, attenuation and scatter correction cannot be performed due to the absence of CT images. Therefore, it is necessary to explore methods for generating pseudo-CT images from PET images. However, traditional PET-to-CT synthesis models encounter conflicts in multi-objective optimization, leading to disparities between synthetic and real images in overall structure and texture. To address this issue, we propose a staged image generation model. Firstly, we construct a dual-stage generator, which synthesizes the overall structure and texture details of images by decomposing optimization objectives and employing multiple loss functions constraints. Additionally, in each generator, we employ improved deep perceptual skip connections, which utilize cross-layer information interaction and deep perceptual selection to effectively and selectively leverage multi-level deep information and avoid interference from redundant information. Finally, we construct a context-aware local discriminator, which integrates context information and extracts local features to generate fine local details of images and reasonably maintain the overall coherence of the images. Experimental results demonstrate that our approach outperforms other methods, with SSIM, PSNR, and FID metrics reaching 0.8993, 29.6108, and 29.7489, respectively, achieving the state-of-the-art. Furthermore, we conduct visual experiments on the synthesized pseudo-CT images in terms of image structure and texture. The results indicate that the pseudo-CT images synthesized in this study are more similar to real CT images, providing accurate structure information for clinical disease analysis and lesion localization.