Spatial resolution enhancement in remote sensing data aims to augment the level of detail and accuracy in images captured by satellite sensors. We proposed a novel spatial resolution enhancement framework using the convolutional attention-based token mixer method. This approach leveraged spatial context and semantic information to improve the spatial resolution of images. This method used the multi-head convolutional attention block and sub-pixel convolution to extract spatial and spectral information and fused them using the same technique. The multi-head convolutional attention block can effectively utilize the local information of spatial and spectral dimensions. The method was tested on two kinds of data types, which were the visual-thermal dataset and the visual-hyperspectral dataset. Our method was also compared with the state-of-the-art methods, including traditional methods and deep learning methods. The experiment results showed that the method was effective and outperformed state-of-the-art methods in overall, spatial, and spectral accuracies.