Person re-identification (re-id) methods based on supervised learning require a large number of manually labeled samples for training, resulting in poor scalability in actual re-id applications. Existing unsupervised video person re-id methods typically focus on extracting appearance features from pedestrian videos, ignoring motion information and the fact that people usually move in groups, i.e., pedestrian spatio–temporal co-occurrence patterns. The key factor for unsupervised video person re-id is to effectively exploit both spatio–temporal clues from video sequences and cross-camera tracklet association. In this work, we propose a progressive deep learning method for unsupervised person re-id via tracklet association with spatio–temporal correlation (TASTC). Specifically, we first divide uniformly each tracklet into multiple temporally localized slices according to a time pyramid structure. Then, an initial re-id model is trained based on a two-stream convolutional architecture, and the accumulative motion context information of temporally localized slices of the tracklets per camera is learned. Finally, combining accumulative motion and tracklet spatial–temporal correlation, we associate tracklets across cameras and update the re-id model. The above steps are iterated to optimize the re-id model progressively. Experiment results demonstrate that the proposed method significantly outperforms the current state-of-the-art unsupervised video person re-identification methods on three video-based benchmark datasets, ILIDS-VID, MARS, and DukeMTMC-VideoReID.