Diffusion models (DM) built from a hierarchy of denoising autoencoders have achieved remarkable progress in image generation, and are increasingly influential in the field of image restoration (IR) tasks. In the meantime, its backbone of autoencoders also evolved from UNet to vision transformer, e.g. Restormer. Therefore, it is important to disentangle the contribution of backbone networks and the additional generative learning scheme. Notably, DM shows varied performance across IR tasks, and the performance of recent advanced transformer-based DM on PET denoising is under-explored. In this study, we further raise an intuitive question, "{if we have a sufficiently powerful backbone, whether DM can be a general add-on generative learning scheme to further boost PET denoising}". Specifically, we investigate one of the best-in-class IR models, i.e., DiffIR, which is a latent DM based on the Restormer backbone. We provide a qualitative and quantitative comparison with UNet, SR3 (UNet+pixel DM), and Restormer, on the 25% low dose 18F-FDG whole-body PET denoising task, aiming to identify the best practices. We trained and tested on 93 and 12 subjects, and each subject has 644 slices. It appears that Restormer outperforms UNet in terms of PSNR and MSE. However, additional latent DM over Restormer does not contribute to better MSE, SSIM, or PSNR in our task, which is even inferior to the conventional UNet. In addition, SR3 with pixel space DM is not stable to synthesize satisfactory results. The results are consistent with the natural image super-resolution tasks, which also suffer from limited spatial information. A possible reason would be the denoising iteration at latent feature space cannot well support detailed structure and texture restoration. This issue is more crucial in the IR tasks taking inputs with limited details, e.g., SR and PET denoising.