This study aims to construct a comprehensive feature system for identifying artificial intelligence–generated content (AIGC) in online Q&A communities, thus uncovering the key factors and mechanisms influencing the identification of AIGC. First, based on the theory of systemic functional linguistics (SFL) and information quality (IQ), this article extracts vocabulary, content, structure, and emotional features from the text, and identifies the AIGC through nine mainstream machine learning algorithms. Subsequently, three widely used resampling strategies are exploited to address the category imbalance problem. The grid search optimisation algorithm fine-tunes different combinations of parameters to improve the performance of the identification classifier. Finally, SHAP values are introduced to evaluate and elucidate the global feature importance and feature influence mechanism. A Chinese corpus from the Zhihu Q&A community is constructed to verify the validity of these methods. The experimental results show that the eXtreme Gradient Boosting (XGBoost) model optimised with hybrid sampling and grid search parameters exhibits excellent performance in identifying AI-generated text, which achieves an F1-score of 0.9935, an improvement of 0.11 percentage points over the original model. In addition, all four dimensions of features constructed in this article contribute to AI-generated text identification, and the results of feature interpretability analysis show the greatest impact of features that focus on content readability. The study facilitates the identification and labelling of AIGC in online Q&A communities, thereby enhancing transparency and accountability of information shared online.
Read full abstract