Using generalizability (G-) theory and rater think-aloud protocols (TAPs) as research methods, this study examined the effects of person, task, rater, and the interactions among these facets on the variability and reliability of the HSK-6 (i.e., an international Chinese proficiency standardized assessment) writing scores assigned by the national HSK writing raters as well as their scoring decision making processes. Sixty-four HSK-6 writing samples written by 32 CFL (Chinese as a foreign language) learners from 17 L1 (first language) backgrounds were scored holistically by ten experienced HSK writing raters using the authentic HSK-6 scoring rubric. They were then invited to produce a written retrospective TAP of their scoring decision making processes immediately after they had completed scoring each HSK-6 writing sample, which resulted in 64 protocols per rater. A total of 640 protocols were included in the qualitative data analysis. The G-theory results indicated that the current single-task and two-rater holistic scoring scheme would be unable to yield acceptable generalizability and dependability coefficients. The rater TAP results also revealed considerable rater variations in their scoring decision making processes. Important implications for the HSK-6 writing assessment policy makers in China are discussed.