Monolingual LLMs in the Age of Multilingual Chatbots
Abstract Among the many exciting features of commercial-grade chatbots such as ChatGPT is their ability to converse in multiple languages. The chatbots are families of large language models (LLMs) built on multilingual textual data, the majority of which is licensed through third-party providers that scraped from the Western internet. In an unevenly distributed internet governed by various data sovereigns, LLMs are fed content accessible from wherever their developers are based: mainly the Anglophone West. The overrepresentation of one region of the world through LLMs is an issue largely disguised by their apparently inclusive multilingual fronts. The versatile chatbots that translate texts also produce and disseminate ideological content, constituting a digital hegemony yet to be named. This essay is a step toward a understanding of the mathematics behind a new kind of language politics. Expanding the Marxist, feminist, and critical race study canons in the critique of knowledge dissemination, this essay shows that LLMs shape culture not only at narrative level but also at the deeper level of parameters and word embeddings.
- Research Article
8
- 10.1287/ijds.2023.0007
- Apr 1, 2023
- INFORMS Journal on Data Science
How Can <i>IJDS</i> Authors, Reviewers, and Editors Use (and Misuse) Generative AI?
- Research Article
- 10.1108/ir-02-2025-0074
- Jul 29, 2025
- Industrial Robot: the international journal of robotics research and application
Purpose This study aims to explore the integration of large language models (LLMs) and vision-language models (VLMs) in robotics, highlighting their potential benefits and the safety challenges they introduce, including robustness issues, adversarial vulnerabilities, privacy concerns and ethical implications. Design/methodology/approach This survey conducts a comprehensive analysis of the safety risks associated with LLM- and VLM-powered robotic systems. The authors review existing literature, analyze key challenges, evaluate current mitigation strategies and propose future research directions. Findings The study identifies that ensuring the safety of LLM-/VLM-driven robots requires a multi-faceted approach. While current mitigation strategies address certain risks, gaps remain in real-time monitoring, adversarial robustness and ethical safeguards. Originality/value This study offers a structured and comprehensive overview of the safety challenges in LLM-/VLM-driven robotics. It contributes to ongoing discussions by integrating technical, ethical and regulatory perspectives to guide future advancements in safe and responsible artificial intelligence-driven robotics.
- Research Article
- 10.1038/s41698-025-00916-7
- May 23, 2025
- npj Precision Oncology
Large language models (LLMs) and large visual-language models (LVLMs) have exhibited near-human levels of knowledge, image comprehension, and reasoning abilities, and their performance has undergone evaluation in some healthcare domains. However, a systematic evaluation of their capabilities in cervical cytology screening has yet to be conducted. Here, we constructed CCBench, a benchmark dataset dedicated to the evaluation of LLMs and LVLMs in cervical cytology screening, and developed a GPT-based semi-automatic evaluation pipeline to assess the performance of six LLMs (GPT-4, Bard, Claude-2.0, LLaMa-2, Qwen-Max, and ERNIE-Bot-4.0) and five LVLMs (GPT-4V, Gemini, LLaVA, Qwen-VL, and ViLT) on this dataset. CCBench comprises 773 question-answer (QA) pairs and 420 visual-question-answer (VQA) triplets, making it the first dataset in cervical cytology to include both QA and VQA data. We found that LLMs and LVLMs demonstrate promising accuracy and specialization in cervical cytology screening. GPT-4 achieved the best performance on the QA dataset, with an accuracy of 70.5% for close-ended questions and average expert evaluation score of 6.9/10 for open-ended questions. On the VQA dataset, Gemini achieved the highest accuracy for close-ended questions at 67.8%, while GPT-4V attained the highest expert evaluation score of 6.1/10 for open-ended questions. Besides, LLMs and LVLMs revealed varying abilities in answering questions across different topics and difficulty levels. However, their performance remains inferior to the expertise exhibited by cytopathology professionals, and the risk of generating misinformation could lead to potential harm. Therefore, substantial improvements are required before these models can be reliably deployed in clinical practice.
- Research Article
71
- 10.1001/jamanetworkopen.2023.46721
- Dec 7, 2023
- JAMA network open
Recent advancements in large language models (LLMs) have shown potential in a wide array of applications, including health care. While LLMs showed heterogeneous results across specialized medical board examinations, the performance of these models in neurology board examinations remains unexplored. To assess the performance of LLMs on neurology board-style examinations. This cross-sectional study was conducted between May 17 and May 31, 2023. The evaluation utilized a question bank approved by the American Board of Psychiatry and Neurology and was validated with a small question cohort by the European Board for Neurology. All questions were categorized into lower-order (recall, understanding) and higher-order (apply, analyze, synthesize) questions based on the Bloom taxonomy for learning and assessment. Performance by LLM ChatGPT versions 3.5 (LLM 1) and 4 (LLM 2) was assessed in relation to overall scores, question type, and topics, along with the confidence level and reproducibility of answers. Overall percentage scores of 2 LLMs. LLM 2 significantly outperformed LLM 1 by correctly answering 1662 of 1956 questions (85.0%) vs 1306 questions (66.8%) for LLM 1. Notably, LLM 2's performance was greater than the mean human score of 73.8%, effectively achieving near-passing and passing grades in the neurology board examination. LLM 2 outperformed human users in behavioral, cognitive, and psychological-related questions and demonstrated superior performance to LLM 1 in 6 categories. Both LLMs performed better on lower-order than higher-order questions, with LLM 2 excelling in both lower-order and higher-order questions. Both models consistently used confident language, even when providing incorrect answers. Reproducible answers of both LLMs were associated with a higher percentage of correct answers than inconsistent answers. Despite the absence of neurology-specific training, LLM 2 demonstrated commendable performance, whereas LLM 1 performed slightly below the human average. While higher-order cognitive tasks were more challenging for both models, LLM 2's results were equivalent to passing grades in specialized neurology examinations. These findings suggest that LLMs could have significant applications in clinical neurology and health care with further refinements.
- Research Article
- 10.1080/13658816.2025.2577252
- Nov 1, 2025
- International Journal of Geographical Information Science
The widespread use of online geoinformation platforms, such as Google Earth Engine (GEE), has produced numerous scripts. Extracting domain knowledge from these crowdsourced scripts supports understanding of geoprocessing workflows. Small Language Models (SLMs) are effective for semantic embedding but struggle with complex code; Large Language Models (LLMs) can summarize scripts, yet lack consistent geoscience terminology to express knowledge. In this paper, we propose Geo-CLASS, a knowledge extraction framework for geospatial analysis scripts that coordinates large and small language models. Specifically, we designed domain-specific schemas and a schema-aware prompt strategy to guide LLMs to generate and associate entity descriptions, and employed SLMs to standardize the outputs by mapping these descriptions to a constructed geoscience knowledge base. Experiments on 237 GEE scripts, selected from 295,943 scripts in total, demonstrated that our framework outperformed LLM baselines, including Llama-3, GPT-3.5 and GPT-4o. In comparison, the proposed framework improved accuracy in recognizing entities and relations by up to 31.9% and 12.0%, respectively. Ablation studies and performance analysis further confirmed the effectiveness of key components and the robustness of the framework. Geo-CLASS has the potential to enable the construction of geoprocessing modeling knowledge graphs, facilitate domain-specific reasoning and advance script generation via Retrieval-Augmented Generation (RAG).
- Research Article
3
- 10.1109/embc53108.2024.10782119
- Jul 15, 2024
- Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual International Conference
Deep phenotyping is the detailed description of patient signs and symptoms using concepts from an ontology. The deep phenotyping of the numerous physician notes in electronic health records requires high throughput methods. Over the past 30 years, progress toward making high-throughput phenotyping feasible. In this study, we demonstrate that a large language model and a hybrid NLP model (combining word vectors with a machine learning classifier) can perform high throughput phenotyping on physician notes with high accuracy. Large language models will likely emerge as the preferred method for high throughput deep phenotyping physician notes.Clinical relevance: Large language models will likely emerge as the dominant method for the high throughput phenotyping of signs and symptoms in physician notes.
- Research Article
5
- 10.1016/j.joms.2024.11.007
- Mar 1, 2025
- Journal of Oral and Maxillofacial Surgery
Evaluating Artificial Intelligence Chatbots in Oral and Maxillofacial Surgery Board Exams: Performance and Potential
- Research Article
8
- 10.2196/59641
- Aug 29, 2024
- JMIR infodemiology
Manually analyzing public health-related content from social media provides valuable insights into the beliefs, attitudes, and behaviors of individuals, shedding light on trends and patterns that can inform public understanding, policy decisions, targeted interventions, and communication strategies. Unfortunately, the time and effort needed from well-trained human subject matter experts makes extensive manual social media listening unfeasible. Generative large language models (LLMs) can potentially summarize and interpret large amounts of text, but it is unclear to what extent LLMs can glean subtle health-related meanings in large sets of social media posts and reasonably report health-related themes. We aimed to assess the feasibility of using LLMs for topic model selection or inductive thematic analysis of large contents of social media posts by attempting to answer the following question: Can LLMs conduct topic model selection and inductive thematic analysis as effectively as humans did in a prior manual study, or at least reasonably, as judged by subject matter experts? We asked the same research question and used the same set of social media content for both the LLM selection of relevant topics and the LLM analysis of themes as was conducted manually in a published study about vaccine rhetoric. We used the results from that study as background for this LLM experiment by comparing the results from the prior manual human analyses with the analyses from 3 LLMs: GPT4-32K, Claude-instant-100K, and Claude-2-100K. We also assessed if multiple LLMs had equivalent ability and assessed the consistency of repeated analysis from each LLM. The LLMs generally gave high rankings to the topics chosen previously by humans as most relevant. We reject a null hypothesis (P<.001, overall comparison) and conclude that these LLMs are more likely to include the human-rated top 5 content areas in their top rankings than would occur by chance. Regarding theme identification, LLMs identified several themes similar to those identified by humans, with very low hallucination rates. Variability occurred between LLMs and between test runs of an individual LLM. Despite not consistently matching the human-generated themes, subject matter experts found themes generated by the LLMs were still reasonable and relevant. LLMs can effectively and efficiently process large social media-based health-related data sets. LLMs can extract themes from such data that human subject matter experts deem reasonable. However, we were unable to show that the LLMs we tested can replicate the depth of analysis from human subject matter experts by consistently extracting the same themes from the same data. There is vast potential, once better validated, for automated LLM-based real-time social listening for common and rare health conditions, informing public health understanding of the public's interests and concerns and determining the public's ideas to address them.
- Conference Article
3
- 10.2118/217671-ms
- Feb 27, 2024
Finding information across multiple databases, formats, and documents remains a manual job in the drilling industry. Large Language Models (LLMs) have proven effective in data-aggregation tasks, including answering questions. However, using LLMs for domain-specific factual responses poses a nontrivial challenge. The expert labor cost for training domain-specific LLMs prohibits niche industries from developing custom question-answering bots. This paper tests several commercial LLMs for information retrieval tasks for drilling data using zero-shot in-context learning. In addition, we studied the model’s calibration using a few-shot multiple-choice drilling questionnaire. To create an LLM benchmark for drilling, we collated the text data from publicly available databases: the Norwegian Petroleum Directorate (NPD), company annual reports, and petroleum glossary. We used a zero-shot learning technique that relies on an LLM’s ability to generate responses for tasks outside its training. We implemented a controlled zero-shot learning "in-context" procedure that sends a user’s query augmented with text data to the LLM as inputs. This implementation encourages the LLM to take the answer from the data while leveraging its pre-trained contextual-learning capability. We evaluated several state-of-the-art generic LLMs available through an API, including G4, G3.5-TI, J2-ultra model, and L2 series. The paper documents the pre-trained LLMs’ ability to provide correct answers and identify petroleum industry jargon from the collated dataset. Our zero-shot in-context learning implementation helps vanilla LLMs provide relevant factual responses for the drilling domain. While each LLM’s performance varies, we have identified models suitable for a drilling chatbot application. In particular, G4 outperformed on all the tasks. This finding suggests that training expensive domain-specific LLMs is not necessary for question-answering tasks in the context of drilling data. We demonstrate the utility of zero-shot in-context learning using pre-trained LLMs for question-answering tasks relevant to the drilling industry. Additionally, we prepared and publicly released the collated datasets from the NPD database and companies’ annual reports to enable results reproducibility and to foster acceleration of language model adoption and development for the subsurface and drilling industries. The petroleum industry may find our solution beneficial for enhancing personnel training and career development. It also offers a method for conducting data analytics and overcoming challenges in retrieving historical well data.
- Conference Article
105
- 10.1145/3510003.3510203
- May 21, 2022
Large pre-trained language models such as GPT-3 [10], Codex [11], and Google's language model [7] are now capable of generating code from natural language specifications of programmer intent. We view these developments with a mixture of optimism and caution. On the optimistic side, such large language models have the potential to improve productivity by providing an automated AI pair programmer for every programmer in the world. On the cautionary side, since these large language models do not understand program semantics, they offer no guarantees about quality of the suggested code. In this paper, we present an approach to augment these large language models with post-processing steps based on program analysis and synthesis techniques, that understand the syntax and semantics of programs. Further, we show that such techniques can make use of user feedback and improve with usage. We present our experiences from building and evaluating such a tool Jigsaw, targeted at synthesizing code for using Python Pandas API using multi-modal inputs. Our experience suggests that as these large language models evolve for synthesizing code from intent, Jigsaw has an important role to play in improving the accuracy of the systems.
- Research Article
8
- 10.1016/j.procs.2023.09.086
- Jan 1, 2023
- Procedia Computer Science
A Large and Diverse Arabic Corpus for Language Modeling
- Research Article
59
- 10.1038/s41746-024-01024-9
- Feb 19, 2024
- NPJ Digital Medicine
Large language models (LLMs) have been shown to have significant potential in few-shot learning across various fields, even with minimal training data. However, their ability to generalize to unseen tasks in more complex fields, such as biology and medicine has yet to be fully evaluated. LLMs can offer a promising alternative approach for biological inference, particularly in cases where structured data and sample size are limited, by extracting prior knowledge from text corpora. Here we report our proposed few-shot learning approach, which uses LLMs to predict the synergy of drug pairs in rare tissues that lack structured data and features. Our experiments, which involved seven rare tissues from different cancer types, demonstrate that the LLM-based prediction model achieves significant accuracy with very few or zero samples. Our proposed model, the CancerGPT (with ~ 124M parameters), is comparable to the larger fine-tuned GPT-3 model (with ~ 175B parameters). Our research contributes to tackling drug pair synergy prediction in rare tissues with limited data, and also advancing the use of LLMs for biological and medical inference tasks.
- Research Article
1
- 10.1609/aies.v7i1.31741
- Oct 16, 2024
- Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society
Large language models (LLMs) increasingly find their way into the most diverse areas of our everyday lives. They indirectly influence people's decisions or opinions through their daily use. Therefore, understanding how and which moral judgements these LLMs make is crucial. However, morality is not universal and depends on the cultural background. This raises the question of whether these cultural preferences are also reflected in LLMs when prompted in different languages or whether moral decision-making is consistent across different languages. So far, most research has focused on investigating the inherent values of LLMs in English. While a few works conduct multilingual analyses of moral bias in LLMs in a multilingual setting, these analyses do not go beyond atomic actions. To the best of our knowledge, a multilingual analysis of moral bias in dilemmas has not yet been conducted. To address this, our paper builds on the moral machine experiment (MME) to investigate the moral preferences of five LLMs, Falcon, Gemini, Llama, GPT, and MPT, in a multilingual setting and compares them with the preferences collected from humans belonging to different cultures. To accomplish this, we generate 6500 scenarios of the MME and prompt the models in ten languages on which action to take. Our analysis reveals that all LLMs inhibit different moral biases to some degree and that they not only differ from the human preferences but also across multiple languages within the models themselves. Moreover, we find that almost all models, particularly Llama 3, divert greatly from human values and, for instance, prefer saving fewer people over saving more.
- Research Article
- 10.3389/frai.2025.1653992
- Jan 1, 2025
- Frontiers in Artificial Intelligence
The use of Large Language Models (LLMs) such as ChatGPT is a prominent topic in higher education, prompting debate over their educational impact. Studies on the effect of LLMs on learning in higher education often rely on self-reported data, leaving an opening for complimentary methodologies. This study contributes by analysing actual course grades as well as ratings by fellow students to investigate how LLMs can affect academic outcomes. We investigated whether using LLMs affected students’ learning by allowing them to choose one of three options for a written assignment: (1) composing the text without LLM assistance; (2) writing a first draft and using an LLM for revisions; or (3) generating a first draft with an LLM and then revising it themselves. Students’ learning was measured by their scores on a mid-course exam and final course grades. Additionally, we assessed how the students rate the quality of fellow students’ texts for each of the three conditions. Finally we examined how accurately fellow students could identify which LLM option (1–3) was used for a given text. Our results indicate only a weak effect of LLM use. However, writing a first draft and using an LLM for revisions compared favourably to the ‘no LLM’ baseline in terms of final grades. Ratings for fellow students’ texts was higher for texts created using option 3, specifically regarding how well-written they were judged to be. Regarding text classification, students most accurately predicted the ‘no LLM’ baseline, but were unable to identify texts that were generated by an LLM and then edited by a student at a rate better than chance.
- Research Article
- 10.1093/geroni/igaf122.1180
- Dec 1, 2025
- Innovation in Aging
LLMs application to clinical practice is growing fast. However, LLMs are less studied in geriatrics practice but are urgently needed. This symposium will address whether LLMs allocation to geriatric practice can be trusted via five approaches. 1) LLMs generated gender and race-biased outputs. We will demonstrate whether LLMs generated age-biased output by assessing their geriatric attitude evaluated by social workers. 2). LLMs passed USMLE and other examinations. We will demonstrate whether LLMs can pass geriatrics knowledge competence tests evaluated by geriatricians 3). LLMs performed well on clinical vignettes from different clinical disciplines. We will demonstrate whether LLMs can perform well on geriatrics 5M-based vignettes of older adults evaluated by clinical providers and trainees 4) LLMs reviewed and summarized clinical charts. We will demonstrate whether LLMs can review geriatrics and general medicine notes to extract Mobility (one of Geriatrics 5Ms) documentation evaluated by geriatricians 5). LLMs can generate deprescribing recommendations, tapering schedules, and patient education materials. We will demonstrate their accuracy, safety, and appropriateness compared to recommendations from a multidisciplinary team of pharmacists, geriatricians, and nurses. Specifically, this symposium will address the following topics: 1) Geriatric Attitude of ChatGPT4.o and Its Evaluation by Social Workers. 2) ChatGPT4.o Geriatrics Knowledge Competency and Its Evaluation by Geriatricians. 3) LLMs application to geriatrics 5Ms evaluated by clinical providers and trainees. 4) Using LLMs to Extract and Assess Mobility Documentation for Age-Friendly Health System evaluated by geriatricians. 5) Using LLMs to generate medication deprescribing recommendations compared to clinician-led deprescribing recommendations.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.