Accurate identification of vehicles and estimating density in traffic surveillance systems is a challenging task, particularly in scenarios with closely spaced lanes. Single Shot MultiBox Detector (SSD) is introduced in vehicle detection and classification due to its speed and accuracy. It utilizes a transfer learning technique that enables them to utilize features from pretrained Convolutional Neural Networks (CNNs). Although the utilization of multi-scale feature maps in SSD has achieved efficient results in vehicle detection, it struggles to identify key features of vehicles due to its one-stage detection approach. Furthermore, the use of VGG16 as the backbone network in these approaches leads to the loss of fine-grained details, posing challenges in accurately localizing and classifying small vehicles. Also, there is no interaction between high- and low-level features, which restricts the networks ability to effectively integrate and utilize both features for accurate vehicle detection. To overcome these limitations, an improved SSD approach is introduced in this paper, leveraging interactive multi-scale attention characteristics to accurately detect and classify vehicles. This approach utilizes ResNet50 as the network backbone to overcome the limitations of traditional SSDs in detecting small vehicles. Also, an attention block is incorporated into it to focus on key details and assign higher attention to relevant pixels within the feature map. Also, the network employs a parallel detection framework and shares multi-scale layers (both high and low), enabling efficient detection of vehicles with various sizes. Then, traffic density is estimated based on the weights assigned to the categorized vehicles obtained from the improved SSD and the recorded area. Finally, traffic is classified as high, low, or moderate by comparing the estimated density to a threshold value. By adding attention characteristics of different scales to the original detection branch and replacing the VGG16 with ResNet50 of the SSD technique using our method, the feature representation capability and detection accuracy are both significantly improved.