Multi-object Tracking (MOT) is very important in human surveillance, sports analytics, autonomous driving, and cooperative robots. Current MOT methods do not perform well in non-uniform movements, occlusion and appearance–reappearance scenarios. We introduce a comprehensive MOT method that seamlessly merges object detection and identity linkage within an end-to-end trainable framework, designed with the capability to maintain object links over a long period of time. Our proposed model, named STMMOT, is architectured around 4 key modules: (1) Candidate proposal creation network, generates object proposals via vision-Transformer encoder–decoder architecture; (2) Scale variant pyramid, progressive pyramid structure to learn the self-scale and cross-scale similarities in multi-scale feature maps; (3) Spatio-temporal memory encoder, extracting the essential information from the memory associated with each object under tracking; and (4) Spatio-temporal memory decoder, simultaneously resolving the tasks of object detection and identity association for MOT. Our system leverages a robust spatio-temporal memory module that retains extensive historical object state observations and effectively encodes them using an attention-based aggregator. The uniqueness of STMMOT resides in representing objects as dynamic query embeddings that are updated continuously, which enables the prediction of object states with an attention mechanism and eradicates the need for post-processing. Experimental results show that STMMOT archives scores of 79.8 and 78.4 for IDF1, 79.3 and 74.1 for MOTA, 73.2 and 69.0 for HOTA, 61.2 and 61.5 for AssA, and maintained an ID switch count of 1529 and 1264 on MOT17 and MOT20, respectively. When evaluated on MOT20, it scored 78.4 in IDF1, 74.1 in MOTA, 69.0 in HOTA, and 61.5 in AssA, and kept the ID switch count to 1264. Compared with the previous best TransMOT, STMMOT achieves around a 4.58% and 4.25% increase in IDF1, and ID switching reduction to 5.79% and 21.05% on MOT17 and MOT20, respectively.