In this study, the performances of 40 Coupled Model Intercomparison Project Phase 6 are evaluated against observational data at synoptic stations in Iran using various evaluation criteria. The results reveal diverse model accuracy across different climate conditions and criteria, emphasizing particularly notable disparities in the nonstationarity R criterion compared to others. Although according to the ranking of the raw and bias-corrected outputs of CMIP6 GCMs for Iran, the NorESM2-MM, AWI-ESM-1-1-LR, and MPI-ESM1-2-LR models are consistently among the top six ranked models for precipitation in both raw and corrected outputs. For temperature, MPI-ESM1-2-LR, TaiESM1, INM-CM4-8, and IITM-ESM are consistently among the top six models for both the raw and bias-corrected outputs of CMIP6 GCMs. The Bias correction methods, including quantile mapping and linear scaling, integrated with Bayesian model averaging, were applied. While quantile mapping demonstrates superior performance and less disparity than linear scaling, it proves ineffective for correcting biases at stations with bias nonstationarity over time. The RMSE for monthly precipitation ranges from almost 0 to 200 mm, with a large RMSE value related to the high precipitation stations, and the monthly temperature exhibits a range of 0 to 4 °C. The use of a multi-model ensemble improves accuracy compared to individual models, resulting in a reduction in the differences between the minimum and maximum RMSE values from 178.6 to 91.0. Additionally, the range for mean absolute error decreases from 126.9 to 93.3, and the difference in the correlation coefficient narrows from 0.9 to 0.42. Averaging models after bias correction prevents significant fluctuations while maintaining higher accuracy, in contrast to the second method, which involves bias-correcting models after averaging.