This paper explores the reliability of using ChatGPT in evaluating EFL writing by assessing its intra- and inter-rater reliability. Eighty-two compositions were randomly sampled from the Written English Corpus of Chinese Learners. These compositions were rated by three experienced raters with regard to ‘language’, ‘content’, and ‘organization’. The writing samples were also rated by ChatGPT twice over some time, and the average scores were calculated. Independent samples t-test was conducted to compare the average scores given by ChatGPT and human raters. Pearson correlation analyses were conducted between the two sets of overall scores given by ChatGPT to calculate the intra-rater reliability, as well as between average scores given by ChatGPT and human raters for inter-rater reliability. The results of comparative analysis shows that ChatGPT may be used for evaluating EFL essays, as the scores are similar to those provided by reliable human raters. However, the result of correlation analyses shows that the intra-rater reliability of ChatGPT is not h igh enough to be acceptable, r=0.575, p<0.01 and the strength of the inter-rater reliability is moderate as well, r=0.508, p<0.01. Besides, there is no significant relationship between their average scores on ‘organization’ of the writings, r=0.181, p>0.05. Thus, it can be concluded that ChatGPT is not a reliable tool to rate and score EFL writings using the prompt in this study. One of the possible reasons for the unreliability of ChatGPT as a rater of EFL writing seems to be related to scoring for the ‘organization’ of the essay. These findings imply that while ChatGPT has potential as an evaluative tool, its current limitations, particularly in assessing organization, must be addressed before it can be reliably used in educational settings.