This paper provides a novel approach using state-of-the-art generative Artificial Intelligence (AI) models to enhance the accuracy of machine learning methods in detecting AI-generated texts; the underlying generative capabilities are used along with ensemble-based learning methods for the exact characterization of created text attributes. Four basic steps are involved in the proposed methodology. The first step of the text process is the preprocessing stage itself consisting of several steps for the purification of irrelevant data. These stages include noise removal, text tokenization, removal of stop-words, word normalization, and handling uncommon words. In the next step, feature engineering and text representations are done whereby every preprocessed text is represented by a square matrix. This matrix encapsulates data about word correlations, cooccurrence, and word weights. The third step is Generative Adversarial Network (GAN)-based feature extraction, using a GAN model to extract efficient features in classifying the texts based on their creator type. After that, it turns the discriminator part into a strong feature extraction model. The fourth step is weighted Random Forest (RF)-based detection, with the features extracted by the discriminator of GAN serving as input to the RF-based detection model. This approach has covered the differences between texts generated by a human and that generated by Artificial Intelligence, with a significant improvement of 99.60% average accuracy, representing a 1.5% improvement against comparative methods.
Read full abstract