Abstract There were two major multiyear, Arctic-wide (60°–90°N) warm anomalies (>0.7°C) in land surface air temperature (LSAT) during the twentieth century, between 1920 and 1950 and again at the end of the century after 1979. Reproducing this decadal and longer variability in coupled general circulation models (GCMs) is a critical test for understanding processes in the Arctic climate system and increasing the confidence in the Intergovernmental Panel on Climate Change (IPCC) model projections. This study evaluated 63 realizations generated by 20 coupled GCMs made available for the IPCC Fourth Assessment for their twentieth-century climate in coupled models (20C3M) and corresponding control runs (PIcntrl). Warm anomalies in the Arctic during the last two decades are reproduced by all ensemble members, with considerable variability in amplitude among models. In contrast, only eight models generated warm anomaly amplitude of at least two-thirds of the observed midcentury warm event in at least one realization, but not its timing. The durations of the midcentury warm events in all the models are decadal, while that of the observed was interdecadal. The variance of the control runs in nine models was comparable with the variance in the observations. The random timing of midcentury warm anomalies in 20C3M simulations and the similar variance of the control runs in about half of the models suggest that the observed midcentury warm period is consistent with intrinsic climate variability. Five models were considered to compare somewhat favorably to Arctic observations in both matching the variance of the observed temperature record in their control runs and representing the decadal mean temperature anomaly amplitude in their 20C3M simulations. Seven additional models could be given further consideration. Results support selecting a subset of GCMs when making predictions for future climate by using performance criteria based on comparison with retrospective data.