The paper presents a novel deep learning approach for crowd counting in intelligent video surveillance systems, addressing the growing need for accurate monitoring of public spaces in urban environments. The demand for precise crowd estimation arises from challenges related to security, public safety, and efficiency in urban areas, particularly during large public events. Existing crowd counting techniques, including feature-based object detection and regression-based methods, face limitations in high-density environments due to occlusions, lighting variations, and diverse human figures. To overcome these challenges, the authors propose a new deep encoder-decoder architecture based on VGG16, which incorporates hierarchical feature extraction with spatial and channel attention mechanisms. This architecture enhances the model’s ability to manage variations in crowd density, leveraging adaptive pooling and dilated convolutions to extract meaningful features from dense crowds. The model’s decoder is further refined to handle sparse and crowded scenes through separate density maps, improving its adaptability and accuracy. Evaluations of the proposed model on benchmark datasets, including Shanghai Tech and UCF CC 50, demonstrate superior performance over state-of-the-art methods, with significant improvements in mean absolute error and mean squared error metrics. The paper emphasizes the importance of addressing environmental variability and scale differences in crowded environments and shows that the proposed model is effective in both sparse and dense crowd conditions. This research contributes to the advancement of intelligent video surveillance systems by providing a more accurate and efficient method for crowd counting, with potential applications in public safety, transportation management, and urban planning.