Transformer has shown its great strength in visual object tracking due to its effective attention mechanism, but most prevailing transformer-based trackers only explore temporal information frame by frame, thus overlooking the rich context information inherent in videos. To alleviate this problem, we propose a transformer-based tracker via learning immediate appearance change information in videos, called IAC-tracker. The proposed tracker enhances the perception of the immediate motion state to improve the performance of single target tracking. IAC-tracker contains three key components: a spatial information extractor (SIE) with a superior attention mechanism to progressively extract spatial information, a temporal information extractor (TIE) with a designed temporal attention mechanism to progressively learn target immediate appearance change, and a novel spatial–temporal context enhanced fusion module integrating the information from SIE and TIE to prepare for the final prediction head. Comparison experiments with state-of-the-art trackers on six challenging datasets demonstrate the superior performance of IAC-tracker with real-time running speed.