Among visual relocalization from a single RGB image, the scene coordinate regression (SCoRe) based on convolutional neural network (CNN) becomes prevailing, however, it is insufficient to extract invariant features under different viewpoints due to fixed geometric structures of CNN. In this letter, we propose a global context-guided spatial feature transformation (SFT) network to learn invariant feature representation for robustness against viewpoint changes. Specifically, global feature extracted from source feature map is regarded as a dynamic convolutional kernel, which is convolved with source feature map for the prediction of transformation parameters. The predicted parameters are used to transform features of multiple viewpoints to a canonical space with the constraint of maximum likelihood-derived loss, and thus viewpoint invariance is achieved. CoordConv is also employed to further improve the discrimination of features on texture-less or repetitive zones. The proposed SFT network can be easily incorporated into the general SCoRe network. To our best knowledge, features are first decoupled from viewpoints explicitly in SCoRe network by the spatial feature transformation network, which achieves a stable and accurate visual relocalization. The experimental results demonstrate the effectiveness of the proposed method in terms of accuracy and efficiency.