Rethinking Data Use in Large Language Models
Abstract Large language models (LMs) such as ChatGPT have revolutionized natural language processing and artificial intelligence more broadly. In this work, we discuss our research on understanding and advancing these models, centered around how they use the very large text corpora they are trained on. First, we describe our efforts to understand how these models learn to perform new tasks after training, demonstrating that their so-called in context learning capabilities are almost entirely determined by what they learn from the training data. Next, we introduce a new class of LMs—nonparametric LMs—that repurpose this training data as a data store from which they retrieve information for improved accuracy and updatability. We discuss our work establishing the foundations of such models, including one of the first broadly used neural retrieval models and an approach that simplifies a traditional, two-stage pipeline into one. We also discuss how nonparametric models open up new avenues for responsible data use, e.g., by segregating permissive and copyrighted text and using them differently. Finally, we envision the next generation of LMs we should build, focusing on efficient scaling, improved factuality, and decentralization.
- Research Article
8
- 10.1287/ijds.2023.0007
- Apr 1, 2023
- INFORMS Journal on Data Science
How Can <i>IJDS</i> Authors, Reviewers, and Editors Use (and Misuse) Generative AI?
- Research Article
4
- 10.1609/aaai.v37i13.26879
- Jun 26, 2023
- Proceedings of the AAAI Conference on Artificial Intelligence
Large neural network-based language models play an increasingly important role in contemporary AI. Although these models demonstrate sophisticated text generation capabilities, they have also been shown to reproduce harmful social biases contained in their training data. This paper presents a project that guides students through an exploration of social biases in large language models. As a final project for an intermediate college course in Artificial Intelligence, students developed a bias probe task for a previously-unstudied aspect of sociolinguistic or sociocultural bias they were interested in exploring. Through the process of constructing a dataset and evaluation metric to measure bias, students mastered key technical concepts, including how to run contemporary neural networks for natural language processing tasks; construct datasets and evaluation metrics; and analyze experimental results. Students reported their findings in an in-class presentation and a final report, recounting patterns of predictions that surprised, unsettled, and sparked interest in advocating for technology that reflects a more diverse set of backgrounds and experiences. Through this project, students engage with and even contribute to a growing body of scholarly work on social biases in large language models.
- Research Article
16
- 10.1162/daed_e_01897
- May 1, 2022
- Daedalus
Getting AI Right: Introductory Notes on AI & Society
- Research Article
1
- 10.1038/s41598-025-98483-1
- Apr 21, 2025
- Scientific Reports
Large language models (LLMs) are artificial intelligence (AI) based computational models designed to understand and generate human like text. With billions of training parameters, LLMs excel in identifying intricate language patterns, enabling remarkable performance across a variety of natural language processing (NLP) tasks. After the introduction of transformer architectures, they are impacting the industry with their text generation capabilities. LLMs play an innovative role across various industries by automating NLP tasks. In healthcare, they assist in diagnosing diseases, personalizing treatment plans, and managing patient data. LLMs provide predictive maintenance in automotive industry. LLMs provide recommendation systems, and consumer behavior analyzers. LLMs facilitates researchers and offer personalized learning experiences in education. In finance and banking, LLMs are used for fraud detection, customer service automation, and risk management. LLMs are driving significant advancements across the industries by automating tasks, improving accuracy, and providing deeper insights. Despite these advancements, LLMs face challenges such as ethical concerns, biases in training data, and significant computational resource requirements, which must be addressed to ensure impartial and sustainable deployment. This study provides a comprehensive analysis of LLMs, their evolution, and their diverse applications across industries, offering researchers valuable insights into their transformative potential and the accompanying limitations.
- Discussion
- 10.14245/ns.2448236.118
- Mar 1, 2024
- Neurospine
The introduction of artificial intelligence (AI), particularly large language models (LLMs) such as the generative pre-trained transformer (GPT) series into the medical field has heralded a new era of data-driven medicine. AI's capacity for processing vast datasets has enabled the development of predictive models that can forecast patient outcomes with remarkable accuracy. LLMs like GPT and its successors have demonstrated an ability to understand and generate human-like text, facilitating their application in medical documentation, patient interaction, and even in generating diagnostic reports from patient data and imaging findings. Over the past 10 years, the development of AI, LLMs, and GPTs has significantly impacted the field of neurosurgery and spinal care as well. [1] [2] [3] [4] [5] Zaidat et al. 6 studied performance of a LLM in the generation of clinical guidelines for antibiotic prophylaxis in spine surgery. This study delves into the capabilities of ChatGPT's models, GPT-3.5 and GPT-4.0, showcasing their potential to streamline medical processes. They suggest that GPT-3.5's ability to generate clinically relevant antibiotic use guidelines for spinal surgery is commendable; however, its limitations, such as the inability to discern the most crucial aspects of the guidelines, redundancy, fabrication of citations, and inconsistency, pose significant barriers to its practical application. GPT-4.0, on the other hand, demonstrates a marked improvement in response accuracy and the ability to cite authoritative guidelines, such as those from the North American Spine Society (NASS). This model's enhanced performance, including a 20% increase in response accuracy and the ability to cite the NASS guideline in over 60% of responses, suggests a more reliable tool for clinicians seeking to integrate AI-generated content into their practice. However, the study's findings also highlight the inherent unpredictability of LLM responses and the potential for "artificial hallucination, " where models generate spurious statements without a solid basis in their training data. This phenomenon raises concerns about the ethical implications of using LLMs in clinical settings, particularly regarding patient care and liability. The possibility of LLMs providing inaccurate responses, especially when prompted for medical advice, necessitates a cautious approach to their deployment. We also pay attention to the limitations of the study itself, including the outdated nature of the NASS guidelines, which have not been updated since 2013, and the potential biases and gaps in the medical knowledge contained within the LLMs' training data. These factors highlight the im-Neurospine
- Research Article
3
- 10.1016/j.joms.2024.11.007
- Mar 1, 2025
- Journal of Oral and Maxillofacial Surgery
Evaluating Artificial Intelligence Chatbots in Oral and Maxillofacial Surgery Board Exams: Performance and Potential
- Research Article
- 10.1158/1557-3265.aimachine-b007
- Jul 10, 2025
- Clinical Cancer Research
Large language models (LLMs) are increasingly pivotal in cancer research, yet current public datasets offer insufficient scale and diversity to capture the complexity of oncology. To address this gap, we created TheBlueScrubs-v1, a 25-billion-token corpus of medical texts curated from the SlimPajama dataset. Approximately one-third of these tokens (∼11 billion) are annotated as cancer-related, making this one of the largest public, domain-focused text collections available for training and benchmarking oncology LLMs. Our two-stage pipeline first applied a high-speed logistic regression classifier (trained on a balanced set of 60,000 medical vs. non-medical documents) to label texts by medical relevance. This process extracted ∼4% of SlimPajama, yielding documents with at least 0.8 probability of containing medical content. Next, a 70B-parameter open-source LLM (Llama 3.1) evaluated each text’s medical scope, factual precision, and safety on 1–5 scales. Validation by clinicians and GPT-4o found strong concordance, confirming the reliability of these automated assessments. We further developed a specialized cancer classifier using logistic regression with TF-IDF features, trained on 60,000 examples, to identify oncology-related texts. This yielded a high-quality oncology subset (∼11 billion tokens) spanning topics such as cancer diagnosis, therapeutics, and real-world clinical notes. Detailed safety metrics enable red-teaming to mitigate misinformation and promote ethical use in oncology research. Potential applications include (1) fine-tuning LLMs for oncology-focused tasks such as treatment recommendation, clinical trial matching, and patient education, (2) building safety classifiers to detect harmful or misleading content, and (3) synthetic data generation to expand training sets while preserving privacy. Early experiments demonstrate that LLMs fine-tuned on TheBlueScrubs-v1 achieve performance on par with or exceeding models trained on smaller, specialized medical corpora. By releasing this large-scale, annotated dataset under an open license, we aim to accelerate innovation in AI-driven cancer research and foster collaborative efforts toward safer, more accurate clinical language models. Citation Format: Luis Felipe, Gilmer Valdes. TheBlueScrubs-v1: A Large-Scale Curated Dataset with ∼11 Billion Oncology Tokens for AI-Driven Cancer Research [abstract]. In: Proceedings of the AACR Special Conference in Cancer Research: Artificial Intelligence and Machine Learning; 2025 Jul 10-12; Montreal, QC, Canada. Philadelphia (PA): AACR; Clin Cancer Res 2025;31(13_Suppl):Abstract nr B007.
- Research Article
- 10.52723/jkl.48.007
- Nov 30, 2023
- The Society Of Korean Literature
The release of ChatGPT to the public at the end of last year had many in the field of education worried. In response, this paper explored the future of college education and artificial intelligence (AI). First, a proper understanding of how large language models (LLMs) “train” and “learn,” along with their abilities and limitations, was established. Simply put, while LLMs produce plausible linguistic output, they are “stochastic parrots” that have no actual understanding of language.
 Next, we examined the dangers of generative AI and discovered that they might help in the creation and dissemination of misinformation. Even if these AI are not used with malicious intent, the fact that their training data sets are drawn from the internet—which reflects majority thinking—means that they can perpetuate and amplify social inequality and hegemonic stereotypes and biases. On the other hand, if we consider what is missing from the training data, it is only natural that marginalized voices should be even more marginalized. In addition, leaving the issue of the socially vulnerable aside, LLMs can only be trained on digital data, meaning analog data is ignored. This is in line with the idea of “the destruction of history” put forth by Joseph Weizenbaum, an early critic who warned of the dangers of artificial intelligence.
 We then discussed the relationship between humans and machines and considered which relationships were problematic and which were desirable. Researchers in the aviation industry recognized the problem of automation bias from an early date, but this phenomenon can be seen in other areas of society as well. Put simply, if a human places too much trust in a machine, they abdicate their decision-making responsibility to that machine and thus fail to respond quickly to solve any problems that may arise should that machine malfunction. LLMs do not endanger lives in the same way that airplanes do, but a similar bias can be seen with them as well. A more important issue, though, is the fact that people are no longer seen as whole human beings but as computers. This tendency was evident long before the advent of computers, for example in the attempts to quantify human intelligence through IQ tests, but it is a problem we must be particularly wary of in the age of AI.
 Lastly, we considered means for college education to find its way in the present situation. Educators in the US in particular, while dealing with ChatGPT, have pinpointed not the LLMs themselves but the “transactional nature” of education as the problem. That is, they argue that education has long since become less a process of learning and more a transaction in which students receive grades and degrees. Given this transactional environment, it is no wonder that student would rely too much on ChatGPT. This over-reliance, however, comes with side effects: not learning how to think properly, a lack of sufficient academic information, and learning an AI-based writing style. In response, US educators have proposed both “stick” (strategies that make it difficult for students to use LLMs) and “carrot” (strategies that encourage students to learn like human beings, not algorithms) solutions, but the heart of the matter seems to be a sense of responsibility. Creating an educational environment in which students can develop a sense of responsibility for themselves is the path forward for education in the age of AI. If we do this, LLMs can become a useful tool rather than an enemy to fear.
- Research Article
28
- 10.5204/mcj.3004
- Oct 2, 2023
- M/C Journal
during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (see The Effect of Open Access).
- Research Article
118
- 10.1097/corr.0000000000002704
- May 23, 2023
- Clinical orthopaedics and related research
Advances in neural networks, deep learning, and artificial intelligence (AI) have progressed recently. Previous deep learning AI has been structured around domain-specific areas that are trained on dataset-specific areas of interest that yield high accuracy and precision. A new AI model using large language models (LLM) and nonspecific domain areas, ChatGPT (OpenAI), has gained attention. Although AI has demonstrated proficiency in managing vast amounts of data, implementation of that knowledge remains a challenge. (1) What percentage of Orthopaedic In-Training Examination questions can a generative, pretrained transformer chatbot (ChatGPT) answer correctly? (2) How does that percentage compare with results achieved by orthopaedic residents of different levels, and if scoring lower than the 10th percentile relative to 5th-year residents is likely to correspond to a failing American Board of Orthopaedic Surgery score, is this LLM likely to pass the orthopaedic surgery written boards? (3) Does increasing question taxonomy affect the LLM's ability to select the correct answer choices? This study randomly selected 400 of 3840 publicly available questions based on the Orthopaedic In-Training Examination and compared the mean score with that of residents who took the test over a 5-year period. Questions with figures, diagrams, or charts were excluded, including five questions the LLM could not provide an answer for, resulting in 207 questions administered with raw score recorded. The LLM's answer results were compared with the Orthopaedic In-Training Examination ranking of orthopaedic surgery residents. Based on the findings of an earlier study, a pass-fail cutoff was set at the 10th percentile. Questions answered were then categorized based on the Buckwalter taxonomy of recall, which deals with increasingly complex levels of interpretation and application of knowledge; comparison was made of the LLM's performance across taxonomic levels and was analyzed using a chi-square test. ChatGPT selected the correct answer 47% (97 of 207) of the time, and 53% (110 of 207) of the time it answered incorrectly. Based on prior Orthopaedic In-Training Examination testing, the LLM scored in the 40th percentile for postgraduate year (PGY) 1s, the eighth percentile for PGY2s, and the first percentile for PGY3s, PGY4s, and PGY5s; based on the latter finding (and using a predefined cutoff of the 10th percentile of PGY5s as the threshold for a passing score), it seems unlikely that the LLM would pass the written board examination. The LLM's performance decreased as question taxonomy level increased (it answered 54% [54 of 101] of Tax 1 questions correctly, 51% [18 of 35] of Tax 2 questions correctly, and 34% [24 of 71] of Tax 3 questions correctly; p = 0.034). Although this general-domain LLM has a low likelihood of passing the orthopaedic surgery board examination, testing performance and knowledge are comparable to that of a first-year orthopaedic surgery resident. The LLM's ability to provide accurate answers declines with increasing question taxonomy and complexity, indicating a deficiency in implementing knowledge. Current AI appears to perform better at knowledge and interpretation-based inquires, and based on this study and other areas of opportunity, it may become an additional tool for orthopaedic learning and education.
- Research Article
3
- 10.1007/978-3-031-64892-2_11
- Jan 1, 2024
- Advances in experimental medicine and biology
A large language model (LLM), in the context of natural language processing and artificial intelligence, refers to a sophisticated neural network that has been trained on a massive amount of text data to understand and generate human-like language. These models are typically built on architectures like transformers. The term "large" indicates that the neural network has a significant number of parameters, making it more powerful and capable of capturing complex patterns in language. One notable example of a large language model is ChatGPT. ChatGPT is a large language model developed by OpenAI that uses deep learning techniques to generate human-like text. It can be trained on a variety of tasks, such as language translation, question answering, and text completion. One of the key features of ChatGPT is its ability to understand and respond to natural language inputs. This makes it a powerful tool for generating a wide range of text, including medical reports, surgical notes, and even poetry. Additionally, the model has been trained on a large corpus of text, which allows it to generate text that is both grammatically correct and semantically meaningful. In terms of applications in neurosurgery, ChatGPT can be used to generate detailed and accurate surgical reports, which can be very useful for sharing information about a patient's case with other members of the medical team. Additionally, the model can be used to generate detailed surgical notes, which can be very useful for training and educating residents and medical students. Overall, LLMs have the potential to be a valuable tool in the field of neurosurgery. Indeed, this abstract has been generated by ChatGPT within few seconds. Potential applications and pitfalls of the applications of LLMs are discussed in this paper.
- Research Article
17
- 10.1016/j.cpa.2024.102722
- Feb 22, 2024
- Critical Perspectives on Accounting
New large language models (LLMs) like ChatGPT have the potential to change qualitative research by contributing to every stage of the research process from generating interview questions to structuring research publications. However, it is far from clear whether such ‘assistance’ will enable or deskill and eventually displace the qualitative researcher. This paper sets out to explore the implications for qualitative research of the recently emerged capabilities of LLMs; how they have acquired their seemingly ‘human-like’ capabilities to ‘converse’ with us humans, and in what ways these capabilities are deceptive or misleading. Building on a comparison of the different ‘trainings’ of humans and LLMs, the paper first traces the seemingly human-like qualities of the LLM to the human proclivity to project communicative intent into or onto LLMs’ purely imitative capacity to predict the structure of human communication. It then goes on to detail the ways in which such human-like communication is deceptive and misleading in relation to the absolute ‘certainty’ with which LLMs ‘converse’, their intrinsic tendencies to ‘hallucination’ and ‘sycophancy’, the narrow conception of ‘artificial intelligence’, LLMs’ complete lack of ethical sensibility or capacity for responsibility, and finally the feared danger of an ‘emergence’ of ‘human-competitive’ or ‘superhuman’ LLM capabilities. The paper concludes by noting the potential dangers of the widespread use of LLMs as ‘mediators’ of human self-understanding and culture. A postscript offers a brief reflection on what only humans can do as qualitative researchers.
- Abstract
- 10.1182/blood-2024-208513
- Nov 5, 2024
- Blood
Evaluating the Accuracy of Artificial Intelligence(AI)-Generated Synopses for Plasma Cell Disorder Treatment Regimens
- Conference Article
- 10.1109/icedeg58167.2023.10122084
- Apr 3, 2023
Recent innovations such as ChatGPT have increased public interest in artificial intelligence (AI). The keynote explained why AI is not just a short-term hype but has a long history of spanning several eras. A recent revolution has been in the field of Natural Language Processing (NLP). This interdisciplinary field of research is also known as computational linguistics. It is usually implemented by specific NLP tasks, ranging from simple processing steps such as tokenization, stemming, lemmatization to Part of Speech (PoS) tagging and topic modeling. A second, more complex set of NLP tasks includes Namend Entity Recognition (NER), information retrieval, relationship extraction, sentiment analysis, text similarity, and coreference resolution. Finally, the most challenging NLP tasks are considered Question Answering (QA), text summarization, text simplification, text generation, text translation, and chatbots. NLP has especially great potential in the public sector. For example, a new multilingual legal language model for more than 20 languages, developed for the Swiss Federal Court, offers opportunities to increase accessibility of legal documents for citizens while preserving the digital sovereignty of government institutions. These technical results of the National Research Program (NRP) 77 project “Open Justice versus Privacy” are published on Hugging Face, a platform for sharing openly available machine learning models and datasets. Today, it is mostly private companies that build such Large Language Models (LLM), because it requires a large amount of computational resources and highly skilled engineers. For example, to train the new LLaMA model, Meta AI (Facebook) needed more than $30 million worth of graphical processing units (GPU). In addition, 450 MWh of electricity worth about $90,000 was needed to process the data on these GPUs. Negative for innovation and the environment, Meta AI released the LLaMA model only under a non-commercial license. This means that startups and other companies cannot use the model for their own services. This calls for a discussion about how “open” today's machine learning models should be and what “open” actually means in the age of AI. The keynote presentation therefore included a proposal of 5 elements of such machine learning models that need to be openly available and licensed under an official open license in order to speak of an Open AI Model. This term is used by the United Nations definition of Digital Public Goods. These five elements include 1) model architecture (detailed scientific publications), 2) hyperparameters (built configuration), 3) training data (labeled and unlabeled datasets), 4) model weights and intermediate checkpoints (parameters), and 5) source code to build the model (programming scripts etc.). A truly openly available AI model is BLOOM, an LLM from the BigScience initiative. It was built by more than 1000 researchers from over 70 countries, trained on an infrastructure that would have cost EUR 3 million. BLOOM was released on July 12th, 2022 on Hugging Face and is licensed under the Responsible AI License (RAIL), a new type of AI license that incorporates ethical aspects while preserving the openness of the machine learning elements described.
- Research Article
2
- 10.3205/zma001702
- Jan 1, 2024
- GMS journal for medical education
The high performance of generative artificial intelligence (AI) and large language models (LLM) in examination contexts has triggered an intense debate about their applications, effects and risks. What legal aspects need to be considered when using LLM in teaching and assessment? What possibilities do language models offer? Statutes and laws are used to assess the use of LLM: - University statutes, state higher education laws, licensing regulations for doctors - Copyright Act (UrhG) - General Data Protection Regulation (DGPR) - AI Regulation (EU AI Act) LLM and AI offer opportunities but require clear university frameworks. These should define legitimate uses and areas where use is prohibited. Cheating and plagiarism violate good scientific practice and copyright laws. Cheating is difficult to detect. Plagiarism by AI is possible. Users of the products are responsible. LLM are effective tools for generating exam questions. Nevertheless, careful review is necessary as even apparently high-quality products may contain errors. However, the risk of copyright infringement with AI-generated exam questions is low, as copyright law allows up to 15% of protected works to be used for teaching and exams. The grading of exam content is subject to higher education laws and regulations and the GDPR. Exclusively computer-based assessment without human review is not permitted. For high-risk applications in education, the EU's AI Regulation will apply in the future. When dealing with LLM in assessments, evaluation criteria for existing assessments can be adapted, as can assessment programmes, e.g. to reduce the motivation to cheat. LLM can also become the subject of the examination themselves. Teachers should undergo further training in AI and consider LLM as an addition.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.