AbstractLand surface models (LSMs) are used to simulate the terrestrial component of water, energy, and biogeochemical cycles. These simulations are useful for water resources management, drought and flood prediction, and numerical climate/weather prediction. However, the usefulness of LSMs are dependent by their ability to reproduce states and fluxes realistically. Accurate measurements of water storage are useful to calibrate and validate LSMs outputs. Geological weighing lysimeters (GWLs) are instruments that can provide field‐scale estimates of integrated total water storage within a soil profile. We use field estimates of total water storage and subsurface storage to critically evaluate two different land surface models: the Modélisation Environnementale communautaire—Surface Hydrology (MESH) which uses the Canadian Land Surface Scheme (CLASS), and the Structure for Unifying Multiple Modeling Alternatives: (SUMMA). These models have differences in how the processes and properties of the land surface are represented. We attempted to parameterize each model in an equivalent manner, to minimize model differences. Both models were able to reproduce observations of total water storage and subsurface storage reasonably well. However, there were inconsistencies in the simulated timing of snowmelt; depth of soil freezing; total evapotranspiration; partitioning of evaporation between soil evaporation and evaporation of intercepted water; and soil drainage. No one model emerged as better overall, though each model had specific strengths and weaknesses that we describe. Insights from this study can be used to improve model physics and performance.