Dual-energy CT can be used to optimize radiation treatment. Recently, deep learning has been demonstrated to synthesize high-energy CT images from low-energy ones for dose reduction and lower CT system burden. As the state-of-the-art deep learning architecture, the computation burden of Transformer increases quadratically with the feature size, making the model training resource-demanding or even infeasible. Here, we introduce an efficient transformer for the balance between CT image synthesis quality and computational burden. The model is a U-shape deep neural network with encoders and decoders built by Transformer blocks. The model input is low-energy 100kVp CT image and the output is high-energy 140kVp one. Each block has a Self Channel Correlation Unit (SCCU) and a Self Spatial Attention Unit (SSAU). Local shortcuts are applied for both units. Under-sampling operation achieved by pixel shuffling is used to obtain multi-scale feature maps, and the transformer block is applied on each feature scale. Symmetric skip connection sending features from shallow layers to deep layers, thus an additional 1 × 1 convolution is used for feature fusion in each decoder. In a SCCU, the feature is first mapped to one Query, one Key, and one Value. Then the Query and the Key tensors perform matrix multiplication to compute cross covariance of feature channels. The channel correlation score can be obtained by normalization of the covariance, and it is used to weight the Value tensor. As a result, the model complexity only increases linearly with the feature size. Besides the channel weighting, we enhance spatial information using SSAU, where the feature is mapped to two tensors. One tensor after activation is used to point-wisely calibrate another tensor. Additional Transformer blocks are cascaded to the last decoder for feature refinement. Because of the structure similarity of low- and high-energy CT images, a global shortcut is used to ease model training. Clinical iodine contrast-enhanced dual energy CT image datasets of 19 patients are used in this study. The dual-energy scanning is performed by a SOMATOM Definition Flash DECT scanner. We split the datasets into training dataset of 15 patients, validation dataset of 1 patient, and testing dataset of 3 patients. The image size is 512 × 512 with pixel size 0.5 × 0.5 mm2. The U-Net model with 1.95M parameters and 44.87G FLOPS achieved the averaged PSNR value of 44.55 dB (s.t.d. 1.34) and averaged RMSE value of 0.0060 (s.t.d. 0.001). In comparison, our efficient Transformer with 1.408M parameters and 31.375G FLOPS achieved the averaged PSNR value of 44.78 dB (s.t.d. 1.37) and RMSE value of 0.0059 (s.t.d. 0.001), demonstrating our model has better performance with small model size and less computation. The efficient Transformer model allows high-resolution CT image synthesis with small model scale and computation burden from low-energy CT image.
Read full abstract