ABSTRACTMany testing programs use automated scoring to grade essays. One issue in automated essay scoring that has not been examined adequately is population invariance and its causes. The primary purpose of this study was to investigate the impact of sampling in model calibration on population invariance of automated scores. This study analyzed scores produced by the e‐rater® scoring engine using a GRE® assessment data set. Results suggested that equal allocation stratification by language sampling approach performed optimally in maximizing population invariance using either human/e‐rater agreement or their correlation pattern differences with external variables as evaluation criteria. Guidelines were given to assist practitioners in choosing a sampling design for model calibration. Potential causes for lack of population invariance, study limitations, and future research are discussed.