In this paper, we propose a pairwise spatial transformer network (PSTN) for cross-view gait recognition, which reduces unwanted feature mis-alignment due to view differences before a recognition step for better performance. The proposed PSTN is a unified CNN architecture that consists of a pairwise spatial transformer (PST) and subsequent recognition network (RN). More specifically, given a matching pair of gait features from different source and target views, the PST estimates a non-rigid deformation field to register the features in the matching pair into their intermediate view, which mitigates distortion by registration compared with the case of direct deformation from the source view to target view. The registered matching pair is then fed into the RN to output a dissimilarity score. Although registration may reduce not only intra-subject variations but also inter-subject variations, we can still achieve a good trade-off between them using a loss function designed to optimize recognition accuracy. Experiments on three publicly available gait datasets demonstrate that the proposed method yields superior performance for both verification and identification scenarios by combining any gait recognition network benchmarks with the PST.