What Makes a High-Quality Training Dataset for Large Language Models: A Practitioners' Perspective

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Large Language Models (LLMs) have demonstrated remarkable performance in various application domains, largely due to their self-supervised pre-training on extensive high-quality text datasets. However, despite the importance of constructing such datasets, many leading LLMs lack documentation of their dataset construction and training procedures, leaving LLM practitioners with a limited understanding of what makes a high-quality training dataset for LLMs. To fill this gap, we initially identified 18 characteristics of high-quality LLM training datasets, as well as 10 potential data pre-processing methods and 6 data quality assessment methods, through detailed interviews with 13 experienced LLM professionals. We then surveyed 219 LLM practitioners from 23 countries across 5 continents. We asked our survey respondents to rate the importance of these characteristics, provide a rationale for their ratings, specify the key data pre-processing and data quality assessment methods they used, and highlight the challenges encountered during these processes. From our analysis, we identified 13 crucial characteristics of high-quality LLM datasets that receive a high rating, accompanied by key rationale provided by respondents. We also identified some widely-used data pre-processing and data quality assessment methods, along with 7 challenges encountered during these processes. Based on our findings, we discuss the implications for researchers and practitioners aiming to construct high-quality training datasets for optimizing LLMs.

Similar Papers
  • Research Article
  • 10.62051/tx2dxj37
Data Annotation Methodologies for Fake News
  • Jul 10, 2025
  • Transactions on Computer Science and Intelligent Systems Research
  • Ruiyi Wang

With the development of technology, information dissemination has become faster and more convenient. Fake news has drawn much attention due to its characteristics, such as rapid spread, strong disguise ability, and great harm. The performance of existing fake news detection models is highly dependent on the quality of training datasets. It is crucial to construct high-quality and lower-cost training datasets. The research progress of fake news dataset construction is systematically reviewed in this paper. Firstly, the categories and definition of fake news and the summary of existing mainstream datasets for detecting fake news are reviewed in this paper. Secondly, for traditional text news and newly derived multimodal news, the advantages and disadvantages of the existing annotation technologies are analyzed starting from the three aspects of traditional manual annotation, semi-automated annotation, and dynamic annotation. Finally, future research directions are proposed to address the problems of current datasets in dynamic annotation, multimodal fusion, and cross-domain generalization. High-quality datasets can effectively promote the development of fake news detection technology to meet the challenges of the increasingly complex network information environment.

  • Research Article
  • Cite Count Icon 1
  • 10.3390/app15073720
Representing Aspectual Meaning in Sentence: Computational Modeling Based on Chinese
  • Mar 28, 2025
  • Applied Sciences
  • Hongchao Liu + 1 more

Situation types can be viewed as the foundation of representation of sentence meaning. Noting that situation types cannot be determined by verbs alone, recent studies often focus on situation type prediction in terms of the combination of different linguistic constituents at the sentence level instead of lexically marked situation types. However, in languages with a fully marked aspectual system, such as Mandarin Chinese, such an approach may miss the opportunity of leveraging lexical aspects as well as other distribution-based lexical cues of event types. Currently, there is a lack of resources and methods for the identification and validation of the lexical aspect, and this issue is particularly severe for Chinese. From a computational linguistics perspective, the main reason for this shortage stems from the absence of a verified lexical aspect classification system, and consequently, a gold-standard dataset annotated according to this classification system. Additionally, owing to the lack of such a high-quality dataset, it remains unclear whether semantic models, including large general-purpose language models, can actually capture this important yet complex semantic information. As a result, the true realization of lexical aspect analysis cannot be achieved. To address these two problems, this paper sets out two objectives. First, we aim to construct a high-quality lexical aspect dataset. Since the classification of the lexical aspect depends on how it interacts with aspectual markers, we establish a scientific classification and data construction process through the selection of vocabulary items, the compilation of co-occurrence frequency matrices, and hierarchical clustering. Second, based on the constructed dataset, we separately evaluate the ability of linguistic features and large language model word embeddings to identify lexical aspect categories in order to (1) verify the capacity of semantic models to infer complex semantics and (2) achieve high-accuracy prediction of lexical aspects. Our final classification accuracy is 72.05%, representing the best result reported thus far.

  • Research Article
  • Cite Count Icon 1
  • 10.1371/journal.pone.0323535
A comparative analysis of large language models versus traditional information extraction methods for real-world evidence of patient symptomatology in acute and post-acute sequelae of SARS-CoV-2.
  • May 15, 2025
  • PloS one
  • Vedansh Thakkar + 8 more

Patient symptoms, crucial for disease progression and diagnosis, are often captured in unstructured clinical notes. Large language models (LLMs) offer potential advantages in extracting patient symptoms compared to traditional rule-based information extraction (IE) systems. This study compared fine-tuned LLMs (LLaMA2-13B and LLaMA3-8B) against BioMedICUS, a rule-based IE system, for extracting symptoms related to acute and post-acute sequelae of SARS-CoV-2 from clinical notes. The study utilized three corpora: UMN-COVID, UMN-PASC, and N3C-COVID. Prevalence, keyword and fairness analyses were conducted to assess symptom distribution and model equity across demographics. BioMedICUS outperformed fine-tuned LLMs in most cases. On the UMN PASC dataset, BioMedICUS achieved a macro-averaged F1-score of 0.70 for positive mention detection, compared to 0.66 for LLaMA2-13B and 0.62 for LLaMA3-8B. For the N3C COVID dataset, BioMedICUS scored 0.75, while LLaMA2-13B and LLaMA3-8B scored 0.53 and 0.68, respectively for positive mention detection. However, LLMs performed better in specific instances, such as detecting positive mentions of change in sleep in the UMN PASC dataset, where LLaMA2-13B (0.79) and LLaMA3-8B (0.65) outperformed BioMedICUS (0.60). For fairness analysis, BioMedICUS generally showed stronger performance across patient demographics. Keyword analysis using ANOVA on symptom distributions across all three corpora showed that both corpus (df = 2, p < 0.001) and symptom (df = 79, p < 0.001) have a statistically significant effect on log-transformed term frequency-inverse document frequency (TF-IDF) values such that corpus accounts for 52% of the variance in log_tfidf values and symptom accounts for 35%. While BioMedICUS generally outperformed the LLMs, the latter showed promising results in specific areas, particularly LLaMA3-8B, in identifying negative symptom mentions. However, both LLaMA models faced challenges in demographic fairness and generalizability. These findings underscore the need for diverse, high-quality training datasets and robust annotation processes to enhance LLMs' performance and reliability in clinical applications.

  • Research Article
  • Cite Count Icon 1
  • 10.2196/73325
Assessing the Impact of the Quality of Textual Data on Feature Representation and Machine Learning Models: Quantitative Study Using Large Language Models
  • Dec 30, 2025
  • Journal of Medical Internet Research
  • Tabinda Sarwar + 2 more

BackgroundData collected in controlled settings typically results in high-quality datasets. However, in real-world applications, the quality of data collection is often compromised. It is well established that the quality of a dataset significantly impacts the performance of machine learning models. In this context, detailed information about individuals is often recorded in progress notes. Given the critical nature of health applications, it is essential to evaluate the impact of textual data quality, as any incorrect prediction can have serious, potentially life-threatening consequences.ObjectiveThis study aims to quantify the quality of textual datasets and systematically evaluate the impact of varying levels of errors on feature representation and machine learning models. The primary goal is to determine whether feature representations and machine learning models are tolerant to errors and to assess whether investing additional time and computational resources to improve data quality is justified.MethodsWe developed a rudimentary error rate metric to evaluate textual dataset quality at the token level. The Mixtral large language model (LLM) was used to quantify and correct errors in low-quality datasets. The study analyzed two health care datasets: the high-quality MIMIC-III public hospital dataset (for mortality prediction) and a lower-quality private dataset from Australian aged care homes (AACHs; for depression and fall risk prediction). Errors were systematically introduced into MIMIC-III at varying rates, while the AACH dataset quality was improved using the LLM. Feature representations and machine learning models were assessed using the area under the receiver operating curve.ResultsFor the sampled 35,774 and 6336 patients from the MIMIC and AACH datasets, respectively, we used Mixtral to introduce errors in MIMIC and correct errors in AACH. Mixtral correctly detected errors in 63% of progress notes, with 17% containing a single token misclassified due to medical terminology. LLMs demonstrated potential for improving progress note quality by addressing various errors. Under varying error rates (5%-20%, in 5% increments), feature representation performance was tolerant to lower error rates (<10%) but declined significantly at higher rates. This aligned with the AACH dataset’s 8% error rate, where no major performance drop was observed. Across both datasets, term frequency–inverted document frequency outperformed embedding features, and machine learning models varied in effectiveness, highlighting that optimal feature representation and model choice depend on the specific task.ConclusionsThis study revealed that models performed relatively well on datasets with lower error rates (<10%), but their performance declined significantly as error rates increased (≥10%). Therefore, it is crucial to evaluate the quality of a dataset before using it for machine learning tasks. For datasets with higher error rates, implementing corrective measures is essential to ensure the reliability and effectiveness of machine learning models.

  • Research Article
  • 10.4258/hir.2025.31.2.166
Advancing Korean Medical Large Language Models: Automated Pipeline for Korean Medical Preference Dataset Construction
  • Apr 1, 2025
  • Healthcare Informatics Research
  • Jean Seo + 5 more

ObjectivesDeveloping large language models (LLMs) in biomedicine requires access to high-quality training and alignment tuning datasets. However, publicly available Korean medical preference datasets are scarce, hindering the advancement of Korean medical LLMs. This study constructs and evaluates the efficacy of the Korean Medical Preference Dataset (KoMeP), an alignment tuning dataset constructed with an automated pipeline, minimizing the high costs of human annotation.MethodsKoMeP was generated using the DAHL score, an automated hallucination evaluation metric. Five LLMs (Dolly-v2-3B, MPT-7B, GPT-4o, Qwen-2-7B, Llama-3-8B) produced responses to 8,573 biomedical examination questions, from which 5,551 preference pairs were extracted. Each pair consisted of a “chosen” response and a “rejected” response, as determined by their DAHL scores. The dataset was evaluated when trained through two different alignment tuning methods, direct preference optimization (DPO) and odds ratio preference optimization (ORPO) respectively across five different models. The KorMedMCQA benchmark was employed to assess the effectiveness of alignment tuning.ResultsModels trained with DPO consistently improved KorMedMCQA performance; notably, Llama-3.1-8B showed a 43.96% increase. In contrast, ORPO training produced inconsistent results. Additionally, English-to-Korean transfer learning proved effective, particularly for English-centric models like Gemma-2, whereas Korean-to-English transfer learning achieved limited success. Instruction tuning with KoMeP yielded mixed outcomes, which suggests challenges in dataset formatting.ConclusionsKoMeP is the first publicly available Korean medical preference dataset and significantly improves alignment tuning performance in LLMs. The DPO method outperforms ORPO in alignment tuning. Future work should focus on expanding KoMeP, developing a Korean-native dataset, and refining alignment tuning methods to produce safer and more reliable Korean medical LLMs.

  • Research Article
  • Cite Count Icon 1
  • 10.1016/j.cmpb.2025.109033
Improving large language models for miRNA information extraction via prompt engineering.
  • Nov 1, 2025
  • Computer methods and programs in biomedicine
  • Rongrong Wu + 9 more

Improving large language models for miRNA information extraction via prompt engineering.

  • Research Article
  • Cite Count Icon 13
  • 10.1186/s13321-024-00928-8
Suitability of large language models for extraction of high-quality chemical reaction dataset from patent literature.
  • Nov 26, 2024
  • Journal of cheminformatics
  • Sarveswara Rao Vangala + 8 more

With the advent of artificial intelligence (AI), it is now possible to design diverse and novel molecules from previously unexplored chemical space. However, a challenge for chemists is the synthesis of such molecules. Recently, there have been attempts to develop AI models for retrosynthesis prediction, which rely on the availability of a high-quality training dataset. In this work, we explore the suitability of large language models (LLMs) for extraction of high-quality chemical reaction data from patent documents. A comparative study on the same set of patents from an earlier study showed that the proposed automated approach can enhance the current datasets by addition of 26% new reactions. Several challenges were identified during reaction mining, and for some of them alternative solutions were proposed. A detailed analysis was also performed wherein several wrong entries were identified in the previously curated dataset. Reactions extracted using the proposed pipeline over a larger patent dataset can improve the accuracy and efficiency of synthesis prediction models in future.Scientific contributionIn this work we evaluated the suitability of large language models for mining a high-quality chemical reaction dataset from patent literature. We showed that the proposed approach can significantly improve the quantity of the reaction database by identifying more chemical reactions and improve the quality of the reaction database by correcting previous errors/false positives.

  • Research Article
  • Cite Count Icon 1
  • 10.1609/aaai.v39i23.34593
Key-Point-Driven Data Synthesis with Its Enhancement on Mathematical Reasoning
  • Apr 11, 2025
  • Proceedings of the AAAI Conference on Artificial Intelligence
  • Yiming Huang + 6 more

Large language models have shown great potential in complex reasoning tasks, yet their performance is often hampered by the scarcity of high-quality and reasoning-focused training datasets. Addressing this challenge, we propose Key-PointDriven Data Synthesis (KPDDS), a novel data synthesis framework that synthesizes question-answer pairs by leveraging key points and exemplar practices from authentic data sources. KPDDS ensures the generation of novel questions with rigorous quality control and substantial scalability. As a result, we present KPMath, an extensive synthetic dataset tailored for mathematical reasoning, comprising over 800K questionanswer pairs. Utilizing KPMath and augmenting it with additional reasoning-intensive corpora, we create the comprehensive KPMath-Plus dataset. Our experiments demonstrate that this dataset can enhance the mathematical reasoning performance of models across various architectures and sizes. The Qwen1.5-72B model, fine-tuned on KPMath-Plus, achieves 87.0% accuracy on GSM8K and 58.3% on MATH, surpassing competitors in the 7B to 72B range and best commercial models like GPT-4 across multiple math reasoning datasets.

  • Research Article
  • 10.14778/3750601.3750630
GalaxyWeaver: Autonomous Table-to-Graph Conversion and Schema Optimization with Large Language Models
  • Aug 1, 2025
  • Proceedings of the VLDB Endowment
  • Bing Tong + 5 more

Most enterprise graph data derives from relational databases, yet transforming relational tables into query-optimized graph schemas remains challenging. Existing approaches have notable limitations: (1) transformations based on primary and foreign keys often fail to generate schemas optimized for query performance; (2) manual schema design, although flexible, is costly and requires domain expertise; and (3) machine learning methods predict graph structures based on data patterns but heavily depend on large, high-quality training datasets. To address these challenges, we propose Galaxy-Weaver, a framework to automate query-aware graph schema generation. GalaxyWeaver utilizes the reasoning power of Large Language Models (LLMs) to align graph schema designs with specific query requirements, effectively integrating domain knowledge with optimization strategies. The framework employs prompt-guided analysis to enhance the decision-making accuracy of LLM agents, facilitating iterative schema refinement. Experiments across diverse domains show that GalaxyWeaver simplifies transformation while improving query performance and reducing storage costs.

  • Research Article
  • Cite Count Icon 3
  • 10.1609/aaai.v39i23.34602
Importance Weighting Can Help Large Language Models Self-Improve
  • Apr 11, 2025
  • Proceedings of the AAAI Conference on Artificial Intelligence
  • Chunyang Jiang + 4 more

Large language models (LLMs) have shown remarkable capability in numerous tasks and applications. However, fine-tuning LLMs using high-quality datasets under external supervision remains prohibitively expensive. In response, LLM self-improvement approaches have been vibrantly developed recently. The typical paradigm of LLM self-improvement involves training LLM on self-generated data, part of which may be detrimental and should be filtered out due to the unstable data quality. While current works primarily employs filtering strategies based on answer correctness, in this paper, we demonstrate that filtering out correct but with high distribution shift extent (DSE) samples could also benefit the results of self-improvement. Given that the actual sample distribution is usually inaccessible, we propose a new metric called DS weight to approximate DSE, inspired by the Importance Weighting methods. Consequently, we integrate DS weight with self-consistency to comprehensively filter the self-generated samples and fine-tune the language model. Experiments show that with only a tiny valid set (up to 5% size of the training set) to compute DS weight, our approach can notably promote the reasoning ability of current LLM self-improvement methods. The resulting performance is on par with methods that rely on external supervision from pre-trained reward models.

  • Research Article
  • Cite Count Icon 18
  • 10.1097/icu.0000000000001091
Foundation models in ophthalmology: opportunities and challenges.
  • Nov 4, 2024
  • Current opinion in ophthalmology
  • Mertcan Sevgi + 4 more

Last year marked the development of the first foundation model in ophthalmology, RETFound, setting the stage for generalizable medical artificial intelligence (GMAI) that can adapt to novel tasks. Additionally, rapid advancements in large language model (LLM) technology, including models such as GPT-4 and Gemini, have been tailored for medical specialization and evaluated on clinical scenarios with promising results. This review explores the opportunities and challenges for further advancements in these technologies. RETFound outperforms traditional deep learning models in specific tasks, even when only fine-tuned on small datasets. Additionally, LMMs like Med-Gemini and Medprompt GPT-4 perform better than out-of-the-box models for ophthalmology tasks. However, there is still a significant deficiency in ophthalmology-specific multimodal models. This gap is primarily due to the substantial computational resources required to train these models and the limitations of high-quality ophthalmology datasets. Overall, foundation models in ophthalmology present promising opportunities but face challenges, particularly the need for high-quality, standardized datasets for training and specialization. Although development has primarily focused on large language and vision models, the greatest opportunities lie in advancing large multimodal models, which can more closely mimic the capabilities of clinicians.

  • Research Article
  • 10.3390/app152412961
Multiple Large AI Models’ Consensus for Object Detection—A Survey
  • Dec 9, 2025
  • Applied Sciences
  • Marcin Iwanowski + 1 more

The rapid development of large artificial intelligence (AI) models—large language models (LLMs), multimodel large language models (MLLMs) and vision–language models (VLMs)—has enabled instruction-driven visual understanding, where a single foundation model can recognize and localize arbitrary objects from natural-language prompts. However, predictions from individual models remain inconsistent—LLMs hallucinate nonexistent entities, while VLMs exhibit limited recall and unstable calibration compared to purpose-trained detectors. To address these limitations, a new paradigm termed “multiple large AI model’s consensus” has emerged. In this approach, multiple heterogeneous LLMs, MLLMs or VLMs process a shared visual–textual instruction and generate independent structured outputs (bounding boxes and categories). Next, their results are merged through consensus mechanisms. This cooperative inference improves spatial accuracy and semantic correctness, making it particularly suitable for generating high-quality training datasets for fast real-time object detectors. This survey provides a comprehensive overview of the large multi-AI model’s consensus for object detection. We formalize the concept, review related literature on ensemble reasoning and multimodal perception, and categorize existing methods into four frameworks: prompt-level, reasoning-to-detection, box-level, and hybrid consensus. We further analyze fusion algorithms, evaluation metrics, and benchmark datasets, highlighting their strengths and limitations. Finally, we discuss open challenges—vocabulary alignment, uncertainty calibration, computational efficiency, and bias propagation—and identify emerging trends such as consensus-aware training, structured reasoning, and collaborative perception ecosystems.

  • Research Article
  • Cite Count Icon 1
  • 10.1609/aaai.v39i2.32119
SS-GEN: A Social Story Generation Framework with Large Language Models
  • Apr 11, 2025
  • Proceedings of the AAAI Conference on Artificial Intelligence
  • Yi Feng + 7 more

Children with Autism Spectrum Disorder (ASD) often misunderstand social situations and struggle to participate in daily routines. Social Stories™ are traditionally crafted by psychology experts under strict constraints to address these challenges but are costly and limited in diversity. As Large Language Models (LLMs) advance, there's an opportunity to develop more automated, affordable, and accessible methods to generate Social Stories in real-time with broad coverage. However, adapting LLMs to meet the unique and strict constraints of Social Stories is a challenging issue. To this end, we propose SS-GEN, a Social Story GENeration framework with LLMs. Firstly, we develop a constraint-driven sophisticated strategy named StarSow to hierarchically prompt LLMs to generate Social Stories at scale, followed by rigorous human filtering to build a high-quality dataset. Additionally, we introduce quality assessment criteria to evaluate the effectiveness of these generated stories. Considering that powerful closed-source large models require very complex instructions and expensive API fees, we finally fine-tune smaller language models with our curated high-quality dataset, achieving comparable results at lower costs and with simpler instruction and deployment. This work marks a significant step in leveraging AI to personalize Social Stories cost-effectively for autistic children at scale, which we hope can encourage future research on special groups.

  • Research Article
  • Cite Count Icon 5
  • 10.1016/j.cose.2024.103814
XLMR4MD: New Vietnamese dataset and framework for detecting the consistency of description and permission in Android applications using large language models
  • Mar 15, 2024
  • Computers &amp; Security
  • Qui Ngoc Nguyen + 2 more

XLMR4MD: New Vietnamese dataset and framework for detecting the consistency of description and permission in Android applications using large language models

  • Preprint Article
  • 10.2196/preprints.75103
Evaluating and Improving Syndrome Differentiation Thinking Ability in Large Language Models: Methods Study (Preprint)
  • Mar 28, 2025
  • Chunliang Chen + 6 more

BACKGROUND Large language models (LLMs) provide new opportunities to advance the intelligent development of Traditional Chinese medicine (TCM). Syndrome differentiation thinking is an essential part of TCM, and equipping LLMs with this capability represents a crucial step toward more effective clinical applications of TCM. However, given the complexity of TCM syndrome differentiation thinking, acquiring this ability is a considerable challenge for the model. OBJECTIVE This study aims to evaluate LLMs' syndrome differentiation thinking ability and design a method to enhance their performance in this area effectively. METHODS We decompose the process of TCM syndrome differentiation thinking into three core tasks: pathogenesis inference, syndrome inference, and diagnostic suggestion. To evaluate the performance of LLMs in these tasks, we constructed a high-quality evaluation dataset, providing a reliable foundation for the quantitative assessment of their capabilities. Furthermore, we developed a methodology for generating instruction data based on the idea of an "open-book exam", customized three data templates, and dynamically retrieved task-relevant professional knowledge, inserted into predefined positions within the templates. This approach effectively generates high-quality instruction data that aligns with the unique characteristics of TCM syndrome differentiation thinking. Leveraging this instruction data, we fine-tuned the base model, enhancing the syndrome differentiation thinking ability of the LLMs. RESULTS We collected 200 medical cases for the evaluation dataset and standardized them into three types of task questions. We tested general and TCM LLMs, comparing their performance with our proposed solution. The results demonstrate that our method significantly enhances LLMs' syndrome differentiation thinking ability. Our model achieved 85.7% and 81.2% accuracy in Tasks 1 and 2, respectively, surpassing the best-performing TCM and general LLMs by 26.3% and 15.8%. In Task 3, our model scored 84.3, indicating that the model is very similar to the advice given by experts. CONCLUSIONS Existing general LLMs and TCM LLMs still have significant limitations in the core task of syndrome differentiation thinking. Our research shows that fine-tuning LLMs by designing professional instruction templates and generating high-quality instruction data can significantly improve their performance in core tasks. The optimized LLMs show a high degree of similarity in reasoning results with the opinions of domain experts, indicating that they can simulate syndrome differentiation thinking to a certain extent. This has important theoretical and practical significance for in-depth interpretation of the complexity of the clinical diagnosis and treatment process of TCM.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant