For the design of new buildings or buildings undergoing major renovations, the use of building performance simulation (BPS) tools is a key instrument to sizing the envelope or to select the best solution to be integrated. Nowadays, many BPS tools are available and are used by researchers and designers, each of which was independently validated, by considering different operating conditions, and rarely were directly compared in the same conditions. The objective of this work is to evaluate the prediction accuracy of the most popular BPS tools, namely TRNSYS, EnergyPlus and IDA ICE, by means of a comparison of the simulated results and the experimental measurements detected under real operating conditions. For this issue, two different small-scale solar test boxes (STBs) with one glazed wall exposed to the outdoor environment of Rome were employed for the experimental investigation. The envelope of the reference STB is insulated and made by conventional materials. In the other case, the STB floor is equipped also with a commercial phase change material (PCM) panel. Both STBs were equipped with a data acquisition system to detect the internal air temperature, the glass external and internal surface temperature and, for the PCM-based STB, also the PCM floor internal surface temperature.A wide description and comparison of the mathematical models used by the three BPS tools are provided, followed by a geometric, weather data, technical and heat transfer parameters alignment was developed to put all the tools in the same conditions. Three different experimental campaign periods were considered and used for the evaluation of each BPS tool accuracy.Some common accuracy indices were used for the comparison, such as the R2, RMSE and normalized RMSE, and an overall accuracy index that summarizes the previous ones in the different experimental campaign periods. The results have shown have highlighted the most accurate mathematical models for the prediction of the dynamic thermal behaviour of the STB in the absence and presence of a PCM. In the absence of PCM in the STB, all the three tools are comparable providing high overall accuracy index in all periods with a rank variable as a function of the period owing to the different treatment of the solar radiation modelling. In the presence of PCM in the STB, IDA ICE leads to the highest overall accuracy index in all periods. Unlike to IDA ICE, TRNSYS and EnergyPlus do not take into account the PCM hysteresis phenomenon. Instead, TRNSYS model provides the worst accuracy since it neglects both hysteresis and phase change temperature range, that is instead implemented both in IDA ICE and EnergyPlus. However, TRNSYS predictions can be retained acceptable for a preliminary evaluation since only low data and very low computational cost is required.