In urban and city environments, road transportation contributes significantly to the generation of substantial traffic. However, this surge in vehicles leads to complex issues, including hindered emergency vehicle movement due to high density and congestion. Scarcity of human personnel amplifies these challenges. As traffic conditions worsen, the need for automated solutions to manage emergency situations becomes more evident. Intelligent traffic monitoring can identify and prioritize emergency vehicles, potentially saving lives. However, categorizing emergency vehicles through visual analysis faces difficulties such as clutter, occlusions, and traffic variations. Visual-based techniques for vehicle detection rely on clear rear views, but this is problematic in dense traffic. In contrast, audio-based methods are resilient to the Doppler Effect from moving vehicles, but handling diverse background noises remains unexplored. Using acoustics for emergency vehicle localization presents challenges related to sensor range and real-world noise. Addressing these issues, this study introduces a novel solution: combining visual and audio data for enhanced detection and localization of emergency vehicles in road networks. Leveraging this multi-modal approach aims to bolster accuracy and robustness in emergency vehicle management. The proposed methodology consists of several key steps. The presence of an emergency vehicle is initially detected through the preprocessing of visual images, involving the removal of clutter and occlusions via an adaptive background model. Subsequently, a cell-wise classification strategy utilizing a customized Visual Geometry Group Network (VGGNet) deep learning model is employed to determine the presence of emergency vehicles within individual cells. To further reinforce the accuracy of emergency vehicle presence detection, the outcomes from the audio data analysis are integrated. This involves the extraction of spectral features from audio streams, followed by classification utilizing a support vector machine (SVM) model. The fusion of information derived from both visual and audio sources is utilized in the construction of a more comprehensive and refined traffic state map. This augmented map facilitates the effective management of emergency vehicle transit. In empirical evaluations, the proposed solution demonstrates its capability to mitigate challenges like visual clutter, occlusions, and variations in traffic density common issues encountered in traditional visual analysis methods. Notably, the proposed approach achieves an impressive accuracy rate of approximately 98.15% in the localization of emergency vehicles.