Abstract. Although the quality of weather forecasts in the polar regions is improving, forecast skill there still lags behind lower latitudes. So far there have been relatively few efforts to evaluate processes in numerical weather prediction systems using in situ and remote sensing datasets from meteorological observatories in the terrestrial Arctic and Antarctic compared to the mid-latitudes. Progress has been limited both by the heterogeneous nature of observatory and forecast data and by limited availability of the parameters needed to perform process-oriented evaluation in multi-model forecast archives. The Year of Polar Prediction (YOPP) site Model Inter-comparison Project (YOPPsiteMIP) is addressing this gap by producing merged observatory data files (MODFs) and merged model data files (MMDFs), bringing together observations and forecast data at polar meteorological observatories in a format designed to facilitate process-oriented evaluation. An evaluation of forecast performance was performed at seven Arctic sites, focussing on the first YOPP Special Observing Period in the Northern Hemisphere (NH-SOP1) in February and March 2018. It demonstrated that although the characteristics of forecast skill vary between the different sites and systems, an underestimation in boundary layer temperature variability across models, which goes hand in hand with an inability to capture cold extremes, is a common issue at several sites. It is found that many models tend to underestimate the sensitivity of the 2 m air temperature (T2m) and the surface skin temperature to variations in radiative forcing, and the reasons for this are discussed.