Real-time multi-object tracking (MOT) is a complex task involving detecting and tracking multiple objects. After the objects are detected, they are assigned markers, and their trajectories are tracked in real-time. The scientific community is intrigued by the possibilities of utilizing MOT technology in the context of smart cities. Their primary focus lies in the domains of intelligent transportation, detection of vehicles and pedestrians, crowd surveillance, and public safety. Deep learning techniques have been developed in recent years to effectively tackle the challenges of real-time MOT tasks and enhance tracking performance. Environmental perception within smart traffic applications relies heavily on sensor data fusion. In traffic scenarios, a thoughtful approach involves utilizing a combination of sensors and cameras to detect and track targets while gathering valuable data effectively. However, it faces challenges when it comes to detecting and tracking objects that are in motion, have complex changes in appearance, or are in crowded scenes. This paper explores the foundational standard for real-time Multiple Object Tracking tasks. We prioritize the examination of quantitative measures by conducting a comprehensive analysis of widely utilized benchmark datasets and metrics. This study also investigates established embedding techniques and multi-modal fusion methods within real-time multi-target tracking algorithms. Each strategy will be classified and assessed according to a predefined set of principles. The paper presents a comprehensive analysis and visual representation of various MOT strategies. Finally, this paper aims to present an overview of the current challenges faced by the MOT mission, as well as the potential objectives that lie ahead.