Computing evapotranspiration (ET) with satellite-based energy balance models such as METRIC (Mapping EvapoTranspiration at high Resolution with Internalized Calibration) requires internal calibration of sensible heat flux using anchor pixels (“hot” and “cold” pixels). Despite the development of automated anchor pixel selection methods that classify a pool of candidate pixels using the amount of vegetation (normalized difference vegetation index, NDVI) and surface temperature (Ts), final pixel selection still relies heavily on operator experience. Yet, differences in final ET estimates resulting from subjectivity in selecting the final “hot” and “cold” pixel pair (from within the candidate pixel pool) have not yet been investigated. This is likely because surface properties of these candidate pixels, as quantified by NDVI and surface temperature, are generally assumed to have low variability that can be attributed to random noise. In this study, we test the assumption of low variability by first applying an automated calibration pixel selection process to 42 nearly cloud-free Landsat images of the San Joaquin area in California taken between 2013 and 2015. We then compute Ts (vertical near-surface temperature differences) vs. Ts relationships at all pixels that could potentially be used for model calibration in order to explore ET variance between the results from multiple calibration schemes where NDVI and Ts variability is intrinsically negligible. Our results show significant variability in ET (ranging from 5% to 20%) and a high—and statistically consistent—variability in dT values, indicating that there are additional surface properties affecting the calibration process not captured when using only NDVI and Ts. Our findings further highlight the potential for calibration improvements by showing that the dT vs. Ts calibration relationship between the cold anchor pixel (with lowest dT) and the hot anchor pixel (with highest dT) consistently provides the best daily ET estimates. This approach of quantifying ET variability based on candidate pixel selection and the accompanying results illustrate an approach to quantify the biases inadvertently introduced by user subjectivity and can be used to inform improvements on model usability and performance.