With global warming intensifying and resource conflicts escalating, the world is undergoing a transformative shift toward sustainable practices and energy-efficient solutions. With more than 32% of the global energy used by commercial and residential buildings, there is an urgent need to revisit traditional approaches to Building Energy Management (BEM). Within a BEMSplatform, regulating the operation of Heating, Ventilation, and Air Conditioning (HVAC) systems is more important, noting that HVAC systems account for about 40% of the total energy cost in the commercial sector.This paper offers a Deep Reinforcement Learning (DRL) algorithm as a data-driven approach to controlling HVAC operation to enhance the energy efficiency of commercial buildings with open offices while ensuring thermal comfort for occupants in different zones. Compared to alternative methods such as rule-based models and model-predictive control, data-driven models have shown promising results in optimizing building energy consumption without the need for building-specific thresholds, prior knowledge about the underlying physics of heat distribution, and digital mapping of the airflow. Despite the astonishing performance of modern DRL methods in controlling energy management, a particular energy-saving solution for open-plan offices with multiple Variable Air Volume (VAV) systems, where different zones cannot be treated independently, is still missing. Also, some of the existing methods suffer from key issues such as long training time and lack of generalizability for using over-complicated models, incorporating external factors that are hard to model and characterize, and including factors that are not typically accessible. To solve these issues, we propose a low-complexity DRL-based model with multi-input multi-output architecture for the HVAC energy optimization of open-plan offices, which uses only a handful of controllable and accessible factors. The efficacy of our solution is evaluated through extensive analysis of the overall energy consumption and thermal comfort levels compared to a baseline system based on the existing HVAC schedule from a real case. This comparison shows that our method achieves 37% savings in energy consumption with minimum temperature violation (<1%) of the desired temperature range during work hours. It takes only a total of 40 min for 5 epochs (about 7.75 min per epoch) to train a network with superior performance and covering diverse conditions for its low-complexity architecture; therefore, it easily adapts to changes in the building setups, weather conditions, occupancy rate, etc. Moreover, by enforcing smoothness on the control strategy, we suppress the frequent and unpleasant on/off transitions on HVAC units to avoid occupant discomfort and potential damage to the system. The generalizability of our model is verified by applying it to different building models and under various weather conditions.