Evaluating Language Dependency in Large Language Models: A Study on Programming Queries in English and Spanish

Pirzado, Farman Ali; Ahmed, Awais; Ibarra-Vázquez, Gerardo; Terashima-Marin, Hugo

Evaluating Language Dependency in Large Language Models: A Study on Programming Queries in English and Spanish (#458)

Read Article

Date of Conference

July 16-18, 2025

Published In

"Engineering, Artificial Intelligence, and Sustainable Technologies in service of society"

Location of Conference

Mexico

Authors

Pirzado, Farman Ali

Ahmed, Awais

Ibarra-Vázquez, Gerardo

Terashima-Marin, Hugo

Abstract

According to recent research, Large Language Models (LLMs) perform well when processing input in the English language. Still, they struggle when processing input in other languages or inputs containing non-English syntax or symbols, such as different languages and programming queries. Therefore, this study evaluates whether programming queries, particularly code generation queries in Spanish, which is a widely spoken non-English language, present challenges similar to those in code generation tasks as compared to English queries. By doing this, this study attempts to find accuracy differences in the code generated by LLMs (Codex and Copilot) in English and Spanish input on a set of programming problems sourced from LeetCode. The study compares the LLMs’ performance on various task complexities, including basic, medium, and advanced task complexities. The results show accuracy differences in the code generated by both LLMs; overall, Codex and Copilot perform better on English input than Spanish. Codex shows a significant decline in accuracy for Spanish inputs (85\%) compared to English (92\%), with higher error rates across syntax, runtime, and logical errors, particularly as task complexity increases. In contrast, Copilot exhibits better performance in generating error-free code on multilingual tasks, achieving consistently high accuracy across English (96\%, 92\%, 90\%) and Spanish queries (90\%, 86\%, 83\%), with smaller performance gaps (-6\%) and reduced error rates. These findings highlight Copilot's superior adaptability and reliability in handling multilingual programming tasks compared to Codex. These results highlight the need for improvement in multilingual capabilities and language-dependent limits of LLMs.

Read Article