Computed tomography (CT) denoising is a challenging task in medical imaging that has garnered considerable attention. Supervised networks require a lot of noisy-clean image pairs, which are always unavailable in clinical settings. Existing self-supervised algorithms for suppressing noise with paired noisy images have limitations, such as ignoring the residual between similar image pairs during training and insufficiently learning the spectrum information of images. In this study, we propose a Residual Image Prior Network (RIP-Net) to sufficiently model the residual between the paired similar noisy images. Our approach offers new insights into the field by addressing the limitations of existing methods. We first establish a mathematical theorem clarifying the non-equivalence between similar-image-based self-supervised learning and supervised learning. It helps us better understand the strengths and limitations of self-supervised learning. Secondly, we introduce a novel regularization term to model a low-frequency residual image prior. This can improve the accuracy and robustness of our model. Finally, we design a well-structured denoising network capable of exploring spectrum information while simultaneously sensing context messages. The network has dual paths for modeling high and low-frequency compositions in the raw noisy image. Additionally, context perception modules capture local and global interactions to produce high-quality images. The comprehensive experiments on preclinical photon-counting CT, clinical brain CT, and low-dose CT datasets, demonstrate that our RIP-Net is superior to other unsupervised denoising methods.