Detection of AI-generated text using ensemble classifiers and stylometric feature extraction

Espin-Riofrio, César; Barba-Salazar, Joel Alejandro; Mendoza-Morán, Verónica; Vergara-Bello, Oswaldo; Zumba Gamboa, Johanna; Ayon-Castillo, Josthin; Zevallos-Escalante, Guillermo

Detection of AI-generated text using ensemble classifiers and stylometric feature extraction (#833)

Read Article

Date of Conference

July 16-18, 2025

Published In

"Engineering, Artificial Intelligence, and Sustainable Technologies in service of society"

Location of Conference

Mexico

Authors

Espin-Riofrio, César

Barba-Salazar, Joel Alejandro

Mendoza-Morán, Verónica

Vergara-Bello, Oswaldo

Zumba Gamboa, Johanna

Ayon-Castillo, Josthin

Zevallos-Escalante, Guillermo

Abstract

The automatic generation of content has transformed the way information is produced and consumed, but it has also posed significant challenges in ensuring its authenticity and reliability, particularly in sectors such as education and media. Differentiating between automatically generated texts and those written by humans is crucial to prevent the spread of misinformation and ensure transparency in the use of these technologies. In this context, this paper proposes an effective approach based on traditional classification models combined with ensemble techniques and advanced Natural Language Processing (NLP) methods, using textual features such as phraseological measures, TF-IDF with n-grams, and perplexity to capture distinctive patterns. The methodology was evaluated on datasets from the COOLING 2025 workshop, including corpora in English, Arabic, and multilingual datasets, covering different sizes and complexities. The Stacking Classifier model achieved an F1-macro of 0.9273 on the large English corpus and 0.9131 on the multilingual corpus, demonstrating its effectiveness in diverse scenarios. Additionally, Logistic Regression and XGBoost achieved perfect performance on smaller and more homogeneous datasets in English and Arabic, respectively. These results highlight the robustness of the proposed approach, which combines key textual features with robust models, offering an effective tool to tackle the challenges of automatic content generation in multilingual and complex contexts

Read Article