English grammar error correction (GEC) has been a popular topic over the past decade. The appropriateness of automatic evaluations, e.g., the combination of metrics and reference types, has been thoroughly studied for English GEC. Yet, such systematic investigations on the Chinese GEC are still insufficient. Specifically, we noticed that two representative Chinese GEC evaluation datasets, namely YACLC and MuCGEC, adopt fluency edits-based references with the automatic evaluation metric, which was designed for minimal edits-based references and differs from the convention of English GEC. However, it is unclear whether such evaluation settings are appropriate. Furthermore, we explored other dimensions of Chinese GEC evaluation, such as the number of references and tokenization granularity, and found that the two datasets exhibit significant differences. We hypothesize that these differences are crucial for Chinese GEC automatic evaluation. Thus, we publish the first human-annotated rankings on Chinese GEC system outputs and conducted an analytical meta-evaluation which discovered that 1) automatic evaluation metrics should match the types of reference; 2) the evaluation performance grows with the number of references, a consistent finding with English GEC, while four is the smallest reference number that empirically shows maximum correlation with human annotators; and 3) the granularity of tokenization has a minor impact, which is however a necessary preprocessing step for Chinese texts. We have made the proposed dataset publicly accessible at https://github.com/wang136906578/RevisitCGEC.
Read full abstract