The crowd within a confined space can potentially lead to air stagnation in waiting areas. Constantly running air conditioning throughout the day to balance air circulation may result in excessive energy consumption by the building. To address this issue, Heating, Ventilating, and Air-Conditioning (HVAC) systems are employed to manage and regulate indoor energy usage. However, sensor-based detection often fails to capture human variables promptly, resulting in less accurate density readings. Camera footage proves to be more reliable than sensors in accurately detecting crowds. This research utilizes You Only Look Once version 8 (YOLOv8), a robust algorithm for object detection, particularly effective in crowd detection for images, along with Convolutional Vision Transformer (CvT) for crowd density level classification into "Normal" and "Crowded" levels. CvT enhances classification accuracy by incorporating function from Convolutional Neural Network (CNN) in model training, including receptive field, shared weights, etc. By integrating YOLOv8 and CvT, this method focuses on accurately classifying crowd density levels after identifying human presence in the waiting area (indoor). Evaluation metrics include mean Average Precision (mAP) for YOLOv8, and accuracy, precision, recall, and f1-score for CvT. This approach directly influences the management of HVAC systems.