There is an effective way to enhance the spatial resolution of hyperspectral remote sensing images by fusing them with multispectral remote sensing images. However, most of the existing deep fusion techniques adopt discretized explicit models to approximate the complex continuous nonlinear mapping in the fusion process, leading to limitations in enhancing the fidelity of spatial details. Additionally, existing algorithms commonly utilize discrete methods such as bilinear or bicubic interpolation during the hyperspectral upsampling process, leading to the loss of crucial spatial-spectral features. To this end, this study proposes a novel Implicit Transformer Fusion Generative Adversarial Network (ITF-GAN), which incorporates the continuity perception mechanism of implicit neural representation with the powerful self-attention mechanism of the Transformer architecture, which uses point-to-point implicit functions aiming to efficiently process information in both spatial and spectral dimensions. Besides, a guided implicit neural sampling module is introduced in the hyperspectral image up-sampling process to enhance the coordinated expression of features in the spatial and spectral domains, which improves the spatial resolution and spectral fidelity of the fused image during the upsampling process. A series of fusion experiments including 4x, 8x, and 16x scale factors have shown that ITF-GAN has significant advantages over current popular fusion algorithms in both objective evaluation indicators and subjective visual evaluation.