Estimating and predicting crowd density at large events or during disasters is of paramount importance for enhancing emergency management systems. This includes planning effective evacuation routes, optimizing rescue operations, and ensuring efficient deployment of emergency services. Traditionally, surveillance systems that rely on cameras have been employed to monitor crowd movements. However, accurately estimating crowd density using such systems presents several challenges. These challenges stem primarily from the interaction between large crowds and the limitations of two-dimensional cameras in capturing the full scope of three-dimensional spaces. Optical distortions, environmental factors, and variations in camera angles further complicate the task, making accurate estimations difficult to achieve. To address these challenges, this paper introduces a robust method for calculating crowd density that leverages advanced vision transformers. By combining the output of these transformers with a two-stage neural network, the method effectively mitigates the limitations of traditional approaches. One of the key advantages of the proposed system is its robustness, which allows it to perform well across different camera specifications, installation locations, and image aspect ratios. The method applies and evaluates various deep learning techniques, introducing improvements to existing network structures that are better suited for the problem at hand. Extensive experimental verification demonstrates that the proposed method consistently produces accurate crowd density estimates, even in diverse and complex crowd environments. This robust performance underscores its potential for improving emergency management and crowd control in real-world situations.