Detection of AI-generated text using ensemble classifiers and stylometric feature extraction (#833)
Read ArticleDate of Conference
July 16-18, 2025
Published In
"Engineering, Artificial Intelligence, and Sustainable Technologies in service of society"
Location of Conference
Mexico
Authors
Espin-Riofrio, César
Barba-Salazar, Joel Alejandro
Mendoza-Morán, Verónica
Vergara-Bello, Oswaldo
Zumba Gamboa, Johanna
Ayon-Castillo, Josthin
Zevallos-Escalante, Guillermo
Abstract
The automatic generation of content has transformed the way information is produced and consumed, but it has also posed significant challenges in ensuring its authenticity and reliability, particularly in sectors such as education and media. Differentiating between automatically generated texts and those written by humans is crucial to prevent the spread of misinformation and ensure transparency in the use of these technologies. In this context, this paper proposes an effective approach based on traditional classification models combined with ensemble techniques and advanced Natural Language Processing (NLP) methods, using textual features such as phraseological measures, TF-IDF with n-grams, and perplexity to capture distinctive patterns. The methodology was evaluated on datasets from the COOLING 2025 workshop, including corpora in English, Arabic, and multilingual datasets, covering different sizes and complexities. The Stacking Classifier model achieved an F1-macro of 0.9273 on the large English corpus and 0.9131 on the multilingual corpus, demonstrating its effectiveness in diverse scenarios. Additionally, Logistic Regression and XGBoost achieved perfect performance on smaller and more homogeneous datasets in English and Arabic, respectively. These results highlight the robustness of the proposed approach, which combines key textual features with robust models, offering an effective tool to tackle the challenges of automatic content generation in multilingual and complex contexts