Background and Aims: Endocrine and metabolic disorders, including diabetes mellitus (DM), pose major global health challenges. Generative artificial intelligence (genAI) models are increasingly used for patient self-help. This study aimed to evaluate the performance of two genAI models, ChatGPT and Microsoft Copilot, in addressing endocrine-related queries in English and Arabic. Materials and Methods: This descriptive study adhered to the METRICS checklist for genAI-based healthcare studies, comparing responses from ChatGPT-4o and Microsoft Copilot to 20 endocrine-related queries in English and Arabic (15 DM queries in addition to five endocrine queries). The responses were evaluated using the CLEAR tool, which assessed completeness, accuracy, and relevance/appropriateness. Three endocrinology experts independently evaluated the genAI outputs. Results: Per language and model, a total of 80 responses were assessed. Inter-rater reliability was high with Intraclass Correlation Coefficient=0.832. ChatGPT-4o consistently outperformed Microsoft Copilot, earning 'Excellent' ratings in English and ‘Very good’ in Arabic, while Microsoft Copilot achieved ‘Very good’ ratings in English and ‘Good’ to ‘Very good’ ratings in Arabic. ChatGPT-4o surpassed Microsoft Copilot in completeness (4.38 vs. 3.36, p<.001), accuracy (4.18 vs. 3.83, p=.014), and relevance (4.44 vs. 3.82, p<.001). Performance varied significantly between English and Arabic responses, with p<.001 for completeness, p=.001 for accuracy, p=.012 for relevance, and p<.001 for the overall CLEAR score. No statistically significant differences were found based on the query topic. Conclusions: ChatGPT-4o outperformed Microsoft Copilot in all CLEAR components, but notable language-based disparities were evident. Addressing these limitations is crucial to ensure equitable access to endocrine care for non-English-speaking patients.