Abstract
This study aims to construct a comprehensive feature system for identifying artificial intelligence–generated content (AIGC) in online Q&A communities, thus uncovering the key factors and mechanisms influencing the identification of AIGC. First, based on the theory of systemic functional linguistics (SFL) and information quality (IQ), this article extracts vocabulary, content, structure, and emotional features from the text, and identifies the AIGC through nine mainstream machine learning algorithms. Subsequently, three widely used resampling strategies are exploited to address the category imbalance problem. The grid search optimisation algorithm fine-tunes different combinations of parameters to improve the performance of the identification classifier. Finally, SHAP values are introduced to evaluate and elucidate the global feature importance and feature influence mechanism. A Chinese corpus from the Zhihu Q&A community is constructed to verify the validity of these methods. The experimental results show that the eXtreme Gradient Boosting (XGBoost) model optimised with hybrid sampling and grid search parameters exhibits excellent performance in identifying AI-generated text, which achieves an F1-score of 0.9935, an improvement of 0.11 percentage points over the original model. In addition, all four dimensions of features constructed in this article contribute to AI-generated text identification, and the results of feature interpretability analysis show the greatest impact of features that focus on content readability. The study facilitates the identification and labelling of AIGC in online Q&A communities, thereby enhancing transparency and accountability of information shared online.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Similar Papers
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.