Patient-Specific 3D CT Images Reconstruction from 2D KV Images Via Vision Transformer-Based Deep-Learning

Y Ding,J Holmes,B Li,C.E Vargas,S.A Vora,W.W Wong,M Fatyga,R.L Foote,S.H Patel,W Liu

doi:10.1016/j.ijrobp.2023.06.2095

Abstract

In some proton therapy facilities, patient alignment relies on two 2D orthogonal kV images, taken at fixed, oblique angles, as no 3D on-the-bed-imaging is available. The visibility of the tumor in kV images is limited since the patient's 3D anatomy is projected onto a 2D plane, especially when the tumor is behind a high-density structure such as bone. This can lead to a large patient setup error. A solution to this problem is to reconstruct the 3D CT image from the kV images obtained in the treatment position. An asymmetric autoencoder-like network built with vision-transformer blocks was developed. The data was collected from a head and neck patient: 2 orthogonal kV images (1024X1024 voxels), 1 3D CT with padding (512X512X512) acquired from the in-room CT-on-rails before kVs were taken and 2 digitally-reconstructed-radiograph (DRR) images (512X512) based on the CT. We resampled kV images every 8 voxels and DRR and CT every 4 voxels, thus formed a dataset consisting of 262,144 samples, in which the images had a dimension of 128 for each direction. The value of each voxel in CT was normalized to range 0-1 with a uniform shift of 1000 and a denominator of 4000. For kV and DRR, we ranked all voxels value in an ascending order and normalized the values of the first 80% voxels to range 0-0.8 and the rest to range 0.8-1, thus yielding a quasi-Gaussian distribution, which was favorable by the deep neural networks. We further cropped kV and DRR images with a self-supervised bitmap based on the voxels' gradients. In training, both kV and DRR were utilized, and the encoder was encouraged to learn the same feature maps for kV images and its corresponding DRR images with mean-absolute-error (MAE) as the similarity loss. Then the decoder would reconstruct the 3D CT image from the feature maps of the kV images with the CT-on-rails as ground-truth (gCT) and MAE as the reconstruction loss. In testing, only independent kV images were used. The full-size synthetic CT (sCT) was achieved by concatenating the sCTs generated by the model according to their spatial information. The image quality of the sCT was evaluated using MAE and per-voxel-absolute-CT-number-difference volume histogram (CDVH). The proposed network was implemented with PyTorch deep learning library and both distributed data parallel (DDP) and automatic mixed precision (AMP) were applied to saving memory and accelerating the training speed. We used the AdamW optimizer with β1 = 0.9 and β2 = 0.999 and a cosine annealing learning rate scheduler with an initial learning of 1e-7 and 20 warm-up epochs. The model achieved a MAE of <40HU and the CDVH showed that <5% of the voxels had a per-voxel-absolute-CT-number-difference larger than 185HU. The profile of a typical gCT slice and its corresponding sCT slice exhibited a high agreement, indicating the high similarity between the gCT and sCT. A patient-specific vision-transformer-based network was developed and shown to be accurate and efficient to reconstruct 3D CT images from kV images.

Full Text