Proton therapy is a form of radiotherapy commonly used to treat various cancers. Due to its high conformality, minor variations in patient anatomy can lead to significant alterations in dose distribution, making adaptation crucial. While cone-beam computed tomography (CBCT) is a well-established technique for adaptive radiation therapy (ART), it cannot be directly used for adaptive proton therapy (APT) treatments because the stopping power ratio (SPR) cannot be estimated from CBCTimages. To address this limitation, Deep Learning methods have been suggested for converting pseudo-CT (pCT) images from CBCT images. In spite of convolutional neural networks (CNNs) have shown consistent improvement in pCT literature, there is still a need for further enhancements to make them suitable for clinicalapplications. The authors introduce the 3D vision transformer (ViT) block, studying its performance at various stages of the proposed architectures. Additionally, they conduct a retrospective analysis of a dataset that includes 259 image pairs from 59 patients who underwent treatment for head and neck cancer. The dataset is partitioned into 80% for training, 10% for validation, and 10% for testingpurposes. The SPR maps obtained from the pCT using the proposed method present an absolute relative error of less than 5% from those computed from the planning CT, thus improving the results ofCBCT. We introduce an enhanced ViT3D architecture for pCT image generation from CBCT images, reducing SPR error within clinical margins for APT workflows. The new method minimizes bias compared to CT-based SPR estimation and dose calculation, signaling a promising direction for future research in this field. However, further research is needed to assess the robustness and generalizability across different medical imaging applications.