Assessing reliability of situational judgment tests (SJTs) in high‐stakes situations is problematic with reliability inappropriately measured by Cronbach's alpha when test items are heterogeneous. We computed the corrected, weighted mean alpha from 56 alpha coefficients, which produced a value of α = .46 and reviewed appropriate types of reliability to use with SJTs. In the current longitudinal study, SJT test–retest reliability was r = .82, compared with internal consistency, α = .46, and stratified alpha, α = .45 at Time 1 and α = .52 and stratified α = .51 at Time 2. We used a student sample (Time 1: n = 185; Time 2: n = 132) with items from a credentialing exam with ‘should do’ instructions. The SJT correlated significantly with cognitive ability, r = .30, and agreeableness, r = .24. In Study 2, we assessed test–retest reliability with Human Resource professionals (Time 1: n = 94; Time 2: n = 32) who had been recently credentialed and who participated in a pilot test of new SJT items with ‘most likely/least likely do’ response options. The SJT test–retest reliability was r = .66 compared with internal consistency, α = .43 and stratified α = .47 at Time 1 and α = .61 and stratified α = .67 at Time 2. We discuss the theoretical implications of the Study 1 results as well as the practical implications for use of SJTs in credentialing examinations.