The increasing use of remote or mobile access, integrated wearable technologies, data exchange, and cloud-based data analytics in modern smart buildings is steering the building industry towards open communication technologies. The increased connectivity and accessibility could lead to more cyber-attacks in smart buildings. On the other hand, physical faults (e.g., HVAC − heating, ventilation, and air-conditioning faults) may have similar adverse impacts as those from the cyber-attacks on building energy systems, such as occupant discomfort, energy wastage, and equipment downtime. However, current physical behavior-based anomaly detection methods fail to differentiate between cyber-attacks and physical faults in building energy systems. Moreover, the challenge in collecting real-world threat data with ground truth has led researchers to rely on numerical models with user-defined assumptions, which may not accurately reflect real-world conditions due to the lack of in-situ experimental datasets. To address these challenges and gaps, this paper presents a flexible hardware-in-the-loop (HIL) testbed for generating cyber-attack and physical fault datasets and demonstrating threat detection algorithms in a real building automation system (BAS) environment. This testbed combines hardware (i.e., real BAS with local HVAC controllers and a physical network) with software (i.e., high-fidelity models to represent behaviors of building envelope and HVAC energy systems), enabling emulations of realistic threats. Five HIL experiments, including one baseline without any threats, two with physical faults, and two with cyber-attacks, were conducted to generate datasets containing detailed network traffic and system states. A joint classification framework, incorporating a network analyzer and a physical HVAC fault detector, was proposed to automatically detect cyber-physical abnormalities on BAS at both the network and the physical HVAC levels. The network analyzer comprises a conditional random fields (CRF) based command validator and a statistics-based detection strategy. The fault detector employs a weather and schedule-based pattern matching and feature-based principal component analysis (WPM-FPCA) method. Evaluation of the classification using four metrics from the multi-class confusion matrix revealed an average accuracy of 90.2 %, recall of 89.7 %, precision of 88.5 % and F1-score of 89.2 %. These results demonstrate that the proposed joint classification framework can effectively differentiate between specific types of cyber-attacks (e.g., device reinitialization attack, network Denial-of-Service attack) and physical faults (e.g., air handling unit operational fault, cooling coil valve stuck) in real time for improved building energy management.