Abstract Background/Introduction Accurate cardiovascular disease (CVD) risk prediction models can facilitate the identification and cost-effective treatment of high-risk individuals. Discrimination and calibration are the most commonly recommended metrics for evaluating model performance. However, both are population-level measures, and it remains unclear whether these metrics are sufficiently sensitive to assess the performance of models designed for use with individual patients. Purpose Using a real-world dataset, we aimed to compare CVD risk estimates for example individuals (individual-level measures) and discrimination and calibration metrics (population-level measures) from two CVD risk prediction models; one with "age" as the only predictor and one with multiple predictors. Methods We constructed a cohort of almost all New Zealanders without CVD or heart failure, alive and aged 30–79 years in 2014, with follow-up linkage to hospitalisations and mortality until 2018 (N = 2,098,359). We derived two sets of sex-specific Cox regression models to predict 5-year CVD risk. The age-only models included "age" as the only predictor, and the full models used nine pre-defined routinely available CVD risk predictors with interaction terms. We calculated and compared individuals’ 5-year CVD risk from the age-only and full models and assessed model performance using Harrell’s C-statistics and calibration plots. Results Risk estimates for individuals using the two models showed substantial differences. A 65-year-old woman had predicted 5-year risks of 4.9% with the age-only model but 16.1% with the full model; a 74-year-old man had predicted risks of 15.2% and 9.2%, respectively, with age-only and full models (Figure 1). However, the models had similar discrimination and calibration (Figure 2). Age-only models had good discrimination with C-statistics of 0.773 (95% CI: 0.771–0.775) in women and 0.740 (0.738–0.742) in men, and calibrated well in both sexes. The full models had slightly better discrimination (C-statistics: 0.814 [0.812–0.816] in women and 0.769 [0.767–0.771] in men), and very similar calibration to the age-only models in both sexes. Conclusions Age-only models showed good discrimination and were well calibrated, but predicted very different risks in individual case studies compared to models with multiple predictors. This study calls into question the use of discrimination and calibration metrics as the main measures for comparing the performance of different CVD risk prediction models. We recommend that examples of predicted risk for individuals and an assessment of the magnitude and prevalence of cardiovascular predictors should also be considered when comparing the performance of prognostic models.