Person re-identification has gained increased attention for its important application in surveillance. Yet one of the major obstacles for the re-identification approaches deploying in practical Internet of Things systems is their weak generalization ability. In spite that the current methods have a high performance under supervised setting, their discrimination ability in unseen domains meets decline. Due to the immutability of attributes among different domains, we attempt to exploit the alignment between the pedestrians’ attributes and visual features to enhance our model’s generalization ability. Furthermore, for the existing methods can’t fully extract the attribute information, we formulate a more effective NLP-based method for attribute feature extraction. Thus, the generated features are termed as textual features and our proposed method are called Visual Textual Alignment (VTA). As for alignment, two strategies are adopted: metric learning based alignment and adversarial learning based alignment. The former is designed to adjust the metric relationship of different persons in feature space. And the latter is aimed to guide our model’s domain-invariant feature learning. The experimental results demonstrate the effectiveness and superiority of our proposed method compared to the state-of-the-art methods.