FineMath: A Fine-Grained Mathematical Evaluation Benchmark for Chinese Large Language Models

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

To thoroughly assess the mathematical reasoning abilities of Large Language Models (LLMs), we need to carefully curate evaluation datasets covering diverse mathematical concepts and mathematical problems at different difficulty levels. In pursuit of this objective, we propose FineMath in this paper, a fine-grained mathematical evaluation benchmark dataset for assessing Chinese LLMs. FineMath is created to cover the major key mathematical concepts taught in elementary school math, which are further divided into 17 categories of math word problems, enabling in-depth analysis of mathematical reasoning abilities of LLMs. All the 17 categories of math word problems are manually annotated with their difficulty levels according to the number of reasoning steps required to solve these problems. We conduct extensive experiments on a wide range of LLMs on FineMath and find that there is still considerable room for improvements in terms of mathematical reasoning capability of Chinese LLMs. We also carry out an in-depth analysis on the evaluation process and methods that have been overlooked previously. These two factors significantly influence the model results and our understanding of their mathematical reasoning capabilities. Our data is available at https://github.com/tjunlp-lab/FineMATH.

Similar Papers
  • Research Article
  • 10.1038/s41597-025-05283-3
MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math Data
  • Aug 8, 2025
  • Scientific Data
  • Meng Fang + 4 more

Large language models (LLMs) have significantly advanced natural language understanding and demonstrated strong problem-solving abilities. Despite these successes, most LLMs still struggle with solving mathematical problems due to the intricate reasoning required. To support rigorous evaluation of mathematical reasoning in LLMs, we introduce the “MathOdyssey” dataset - a curated collection of 387 expert-generated mathematical problems spanning high school, university, and Olympiad-level topics. Each problem is accompanied by a detailed solution and categorized by difficulty level, subject area, and answer type. The dataset was developed through a rigorous multi-stage process involving contributions from subject experts, peer review, and standardized formatting. We provide detailed metadata and a standardized schema to facilitate consistent use in downstream applications. To demonstrate the dataset’s utility, we evaluate several representative LLMs and report their performance across problem types. We release MathOdyssey as an open-access resource to enable reproducible and fine-grained assessment of mathematical capabilities in LLMs and to foster further research in mathematical reasoning and education.

  • Research Article
  • 10.1038/s41698-025-00916-7
Evaluating the performance of large language & visual-language models in cervical cytology screening
  • May 23, 2025
  • npj Precision Oncology
  • Qi Hong + 15 more

Large language models (LLMs) and large visual-language models (LVLMs) have exhibited near-human levels of knowledge, image comprehension, and reasoning abilities, and their performance has undergone evaluation in some healthcare domains. However, a systematic evaluation of their capabilities in cervical cytology screening has yet to be conducted. Here, we constructed CCBench, a benchmark dataset dedicated to the evaluation of LLMs and LVLMs in cervical cytology screening, and developed a GPT-based semi-automatic evaluation pipeline to assess the performance of six LLMs (GPT-4, Bard, Claude-2.0, LLaMa-2, Qwen-Max, and ERNIE-Bot-4.0) and five LVLMs (GPT-4V, Gemini, LLaVA, Qwen-VL, and ViLT) on this dataset. CCBench comprises 773 question-answer (QA) pairs and 420 visual-question-answer (VQA) triplets, making it the first dataset in cervical cytology to include both QA and VQA data. We found that LLMs and LVLMs demonstrate promising accuracy and specialization in cervical cytology screening. GPT-4 achieved the best performance on the QA dataset, with an accuracy of 70.5% for close-ended questions and average expert evaluation score of 6.9/10 for open-ended questions. On the VQA dataset, Gemini achieved the highest accuracy for close-ended questions at 67.8%, while GPT-4V attained the highest expert evaluation score of 6.1/10 for open-ended questions. Besides, LLMs and LVLMs revealed varying abilities in answering questions across different topics and difficulty levels. However, their performance remains inferior to the expertise exhibited by cytopathology professionals, and the risk of generating misinformation could lead to potential harm. Therefore, substantial improvements are required before these models can be reliably deployed in clinical practice.

  • Research Article
  • 10.1145/3732784
TencentLLMEval: A Hierarchical Evaluation of Real-World Capabilities for Human-Aligned LLMs
  • Apr 29, 2025
  • ACM Transactions on Intelligent Systems and Technology
  • Shuyi Xie + 15 more

Large language models (LLMs) have shown impressive capabilities across various natural language tasks. However, evaluating their alignment with human preferences remains a challenge. To this end, we propose a comprehensive human evaluation framework to assess LLMs’ proficiency in following instructions on diverse real-world tasks. We construct a hierarchical task tree encompassing 7 major areas covering over 200 categories and over 800 tasks, which covers diverse capabilities such as question answering, reasoning, multiturn dialogue, and text generation, to evaluate LLMs in a comprehensive and in-depth manner. We also design detailed evaluation standards and processes to facilitate consistent, unbiased judgments from human evaluators. A test set of over 3,000 instances is released, spanning different difficulty levels and knowledge domains. Our work provides a standardized methodology to evaluate human alignment in LLMs for both English and Chinese. We also analyze the feasibility of automating parts of evaluation with a strong LLM (GPT-4). Our framework supports a thorough assessment of LLMs as they are integrated into real-world applications. We have made publicly available the task tree, TencentLLMEval dataset, and evaluation methodology which have been demonstrated as effective in assessing the performance of Tencent Hunyuan LLMs 1 . By doing so, we aim to facilitate the benchmarking of advances in the development of safe and human-aligned LLMs.

  • Research Article
  • 10.1609/aaai.v39i22.34585
RMath: A Logic Reasoning-Focused Datasets Toward Mathematical Multistep Reasoning Tasks
  • Apr 11, 2025
  • Proceedings of the AAAI Conference on Artificial Intelligence
  • Ziyi Hu + 5 more

Mathematical reasoning ability objectively reflects a language model's understanding of implicit knowledge in contexts, with logic being a prerequisite for exploring, articulating and establishing effective reasoning. Large language models (LLMs) have shown great potential in complex reasoning tasks represented by mathematical reasoning. However, existing mathematical datasets either focus on commonsense reasoning, assessing the model's knowledge application ability, or arithmetic problems with fixed calculation rules, evaluating the model's rapid learning capability. There is a lack of datasets that require solving problems solely through logical reasoning. As a result, the performance of LLMs in accurately understanding the implicit logical relationships in problems and deriving conclusions based solely on given conditions is hindered. To address this challenge, we construct a dataset specifically for multiple step reasoning tasks: Reasoning-Math (RMath). This dataset focuses on evaluating logical reasoning abilities with mathematical reasoning problems, covering typical problem types, including direct reasoning problems, hypothetical reasoning problems, and nested reasoning problems. Additionally, we design a standardized annotation scheme that transforms natural language descriptions of conditions into formal propositions. Other annotation contents include problem categories, proposition truth values, and proposition relationship types. This not only reduces biases caused by semantic misunderstandings during problem-solving, but also facilitates the incorporation of theoretically grounded logical reasoning methods to enhance reasoning abilities. Furthermore, we propose a normalization problem-solving framework based on propositional logic for RMath and design the problem-solving process for prompt tuning to guide LLMs to absorb mathematical logical theories and improving reasoning abilities. Finally, we evaluate RMath on several popular LLMs and present the corresponding results.

  • Abstract
  • 10.1182/blood-2024-208513
Evaluating the Accuracy of Artificial Intelligence(AI)-Generated Synopses for Plasma Cell Disorder Treatment Regimens
  • Nov 5, 2024
  • Blood
  • Aleenah Mohsin + 7 more

Evaluating the Accuracy of Artificial Intelligence(AI)-Generated Synopses for Plasma Cell Disorder Treatment Regimens

  • Research Article
  • Cite Count Icon 1
  • 10.7759/cureus.81871
Evaluating the Accuracy and Reliability of Large Language Models (ChatGPT, Claude, DeepSeek, Gemini, Grok, and Le Chat) in Answering Item-Analyzed Multiple-Choice Questions on Blood Physiology.
  • Apr 8, 2025
  • Cureus
  • Mayank Agarwal + 2 more

Background Previous research has highlighted the potential of large language models (LLMs) in answering multiple-choice questions (MCQs) in medical physiology. However, their accuracy and reliability in specialized fields, such as blood physiology, remain underexplored. This study evaluates the performance of six free-to-use LLMs (ChatGPT, Claude, DeepSeek, Gemini, Grok, and Le Chat) in solving item-analyzed MCQs on blood physiology. The findings aim to assess their suitability as educational aids. Methods This cross-sectional study at the All India Institute of Medical Sciences, Raebareli, India, involved administering a 40-item MCQ test on blood physiology to 75 first-year medical students. Item analysis utilized the Difficulty Index (DIF I), Discrimination Index (DI), and Distractor Effectiveness (DE). Internal consistency was assessed with the Kuder-Richardson 20 (KR-20) coefficient. These 40 item-analyzed MCQs were presented to six selected LLMs (ChatGPT, Claude, DeepSeek, Gemini, Grok, Le Chat) available as standalone Android applications on March 19, 2025. Three independent users accessed each LLM simultaneously, uploading the compiled MCQs in a Portable Document Format (PDF) file. Accuracy was determined as the percentage of correct responses averaged across all three users. Reliability was measured as the percentage of MCQs consistently answered correctly by LLM to all three users. Descriptive statistics were presented as mean ± standard deviation and percentages. Pearson's correlation coefficient or Spearman's rho was used to evaluate the associations between variables, with p < 0.05 considered significant. Results Item analysis confirmed the validity and reliability of the assessment tool, with a DIF I of 63.2 ± 20.4, a DI of 0.38 ± 0.20, a DE of 66.7 ± 33.3, and a KR-20 of 0.804. Among LLMs, Claude 3.7demonstrated the highest reliable accuracy (95%), followed by DeepSeek (93%), Grok 3 beta (93%), ChatGPT (90%), Gemini 2.0 (88%), and Mistral Le Chat (70%). No significant correlations were found between LLM performance and MCQ difficulty, discrimination power, or distractor effectiveness. Conclusions The MCQ assessment tool exhibited an appropriate difficulty level, strong discriminatory power, and adequately constructed distractors. LLMs, particularly Claude, DeepSeek, and Grok, demonstrated high accuracy and reliability in solving blood physiology MCQs, supporting their role as supplementary educational tools. LLMs handled questions of varying difficulty, discrimination power, and distractor effectiveness with similar competence. However, given occasional errors, they should be used alongside traditional teaching methods and expert supervision.

  • Research Article
  • 10.1609/aaai.v39i24.34749
S^3cMath: Spontaneous Step-Level Self-Correction Makes Large Language Models Better Mathematical Reasoners
  • Apr 11, 2025
  • Proceedings of the AAAI Conference on Artificial Intelligence
  • Yuchen Yan + 7 more

Self-correction is a novel method that can stimulate the potential reasoning abilities of large language models (LLMs). It involves detecting and correcting errors during the inference process when LLMs solve reasoning problems. However, recent works do not regard self-correction as a spontaneous and intrinsic capability of LLMs. Instead, such correction is achieved through post-hoc generation, external knowledge introduction, multi-model collaboration, and similar techniques. In this paper, we propose a series of mathematical LLMs called S^3cMath, which are able to perform Spontaneous Step-level Self-correction for Mathematical reasoning. This capability helps LLMs to recognize whether their ongoing inference tends to contain errors and simultaneously correct these errors to produce a more reliable response. We proposed a method, which employs a step-level sampling approach to construct step-wise self-correction data for achieving such ability. Additionally, we implement a training strategy that uses above constructed data to equip LLMs with spontaneous step-level self-correction capacities. Our data and methods have been demonstrated to be effective across various foundation LLMs, consistently showing significant progress in evaluations on GSM8K, MATH, and other mathematical benchmarks. To the best of our knowledge, we are the first to introduce the spontaneous step-level self-correction ability of LLMs in mathematical reasoning.

  • Research Article
  • 10.1609/aaai.v39i24.34760
Assessing the Creativity of LLMs in Proposing Novel Solutions to Mathematical Problems
  • Apr 11, 2025
  • Proceedings of the AAAI Conference on Artificial Intelligence
  • Junyi Ye + 4 more

The mathematical capabilities of AI systems are complex and multifaceted. Most existing research has predominantly focused on the correctness of AI-generated solutions to mathematical problems. In this work, we argue that beyond producing correct answers, AI systems should also be capable of, or assist humans in, developing novel solutions to mathematical challenges. This study explores the creative potential of Large Language Models (LLMs) in mathematical reasoning, an aspect that has received limited attention in prior research. We introduce a novel framework and benchmark, CreativeMath, which encompasses problems ranging from middle school curricula to Olympic-level competitions, designed to assess LLMs' ability to propose innovative solutions after some known solutions have been provided. Our experiments demonstrate that, while LLMs perform well on standard mathematical tasks, their capacity for creative problem-solving varies considerably. Notably, the Gemini-1.5-Pro model outperformed other LLMs in generating novel solutions. This research opens a new frontier in evaluating AI creativity, shedding light on both the strengths and limitations of LLMs in fostering mathematical innovation, and setting the stage for future developments in AI-assisted mathematical discovery.

  • Research Article
  • 10.1609/aaai.v39i23.34645
VCR: A “Cone of Experience” Driven Synthetic Data Generation Framework for Mathematical Reasoning
  • Apr 11, 2025
  • Proceedings of the AAAI Conference on Artificial Intelligence
  • Sannyuya Liu + 5 more

Large language models (LLMs) have shown excellent performance in natural language processing but struggle with mathematical reasoning. As the training mode gradually solidifies, researchers propose a data-centric concept of artificial intelligence, emphasizing the development of higher-quality data to empower LLMs. Existing studies construct synthetic data for mathematical reasoning by expanding public datasets, thereby performing supervised fine-tuning of LLMs. However, these methods mostly focus on quantity while neglecting quality. The challenging samples fail to receive adequate consideration during data synthesis process, resulting in high construction costs, low-quality density, and serious data homogenization. This paper proposes a multi-agent environment called Virtual ClassRoom (VCR), which leverages various agents driven by LLM to construct high-quality diversified synthetic data. Inspired by the "Cone of Experience" educational theory, VCR introduces three experience levels (direct, iconic, and symbolic) into data synthesis process by analogy with human learning. A user-friendly instruction set and role-playing system are carefully designed, enabling VCR to autonomously plan the scale of synthetic data. This system covers various educational scenarios, including lecture, discussion, problem design and problem-solving. The Adaboost idea embodied in the global iterative process further promotes steady performance improvement. Extensive experiments show that the synthetic data generated by VCR possess higher quality density and generalization capability, which can give LLMs superior mathematical reasoning performance with the same scale.

  • Research Article
  • Cite Count Icon 1
  • 10.1093/jamia/ocaf023
Application of unified health large language model evaluation framework to In-Basket message replies: bridging qualitative and quantitative assessments.
  • Mar 10, 2025
  • Journal of the American Medical Informatics Association : JAMIA
  • Chuan Hong + 13 more

Large language models (LLMs) are increasingly utilized in healthcare, transforming medical practice through advanced language processing capabilities. However, the evaluation of LLMs predominantly relies on human qualitative assessment, which is time-consuming, resource-intensive, and may be subject to variability and bias. There is a pressing need for quantitative metrics to enable scalable, objective, and efficient evaluation. We propose a unified evaluation framework that bridges qualitative and quantitative methods to assess LLM performance in healthcare settings. This framework maps evaluation aspects-such as linguistic quality, efficiency, content integrity, trustworthiness, and usefulness-to both qualitative assessments and quantitative metrics. We apply our approach to empirically evaluate the Epic In-Basket feature, which uses LLM to generate patient message replies. The empirical evaluation demonstrates that while Artificial Intelligence (AI)-generated replies exhibit high fluency, clarity, and minimal toxicity, they face challenges with coherence and completeness. Clinicians' manual decision to use AI-generated drafts correlates strongly with quantitative metrics, suggesting that quantitative metrics have the potential to reduce human effort in the evaluation process and make it more scalable. Our study highlights the potential of a unified evaluation framework that integrates qualitative and quantitative methods, enabling scalable and systematic assessments of LLMs in healthcare. Automated metrics streamline evaluation and monitoring processes, but their effective use depends on alignment with human judgment, particularly for aspects requiring contextual interpretation. As LLM applications expand, refining evaluation strategies and fostering interdisciplinary collaboration will be critical to maintaining high standards of accuracy, ethics, and regulatory compliance. Our unified evaluation framework bridges the gap between qualitative human assessments and automated quantitative metrics, enhancing the reliability and scalability of LLM evaluations in healthcare. While automated quantitative evaluations are not ready to fully replace qualitative human evaluations, they can be used to enhance the process and, with relevant benchmarks derived from the unified framework proposed here, they can be applied to LLM monitoring and evaluation of updated versions of the original technology evaluated using qualitative human standards.

  • Conference Article
  • 10.54941/ahfe1006669
Enhancing Thematic Analysis with Local LLMs: A Scientific Evaluation of Prompt Engineering Techniques
  • Jan 1, 2025
  • Timothy Meyer + 2 more

Thematic Analysis (TA) is a powerful tool for human factors, HCI, and UX researchers to gather system usability insights from qualitative data like open-ended survey questions. However, TA is both time consuming and difficult, requiring researchers to review and compare hundreds, thousands, or even millions of pieces of text. Recently, this has driven many to explore using Large Language Models (LLMs) to support such an analysis. However, LLMs have their own processing limitations and usability challenges when implementing them reliably as part of a research process – especially when working with a large corpus of data that exceeds LLM context windows. These challenges are compounded when using locally hosted LLMs, which may be necessary to analyze sensitive and/or proprietary data. However, little human factors research has rigorously examined how various prompt engineering techniques can augment an LLM to overcome these limitations and improve usability. Accordingly, in the present paper, we investigate the impact of several prompt engineering techniques on the quality of LLM-mediated TA. Using a local LLM (Llama 3.1 8b) to ensure data privacy, we developed four LLM variants with progressively complex prompt engineering techniques and used them to extract themes from user feedback regarding the usability of a novel knowledge management system prototype. The LLM variants were as follows:1.A “baseline” variant with no prompt engineering or scalability2.A “naïve batch processing” variant that sequentially analyzed small batches of the user feedback to generate a single list of themes3.An “advanced batch processing” variant that built upon the naïve variant by adding prompt engineering techniques (e.g., chain-of-thought prompting)4.A “cognition-inspired” variant that incorporated advanced prompt engineering techniques and kept a working memory-like log of themes and their frequencyContrary to conventional approaches to studying LLMs, which largely rely upon descriptive statistics (e.g., % improvement), we systematically applied a set of evaluation methods from behavioral science and human factors. We performed three stages of evaluation of the outputs of each LLM variant: we compared the LLM outputs to our team’s original TA, we had human factors professionals (N = 4) rate the quality and usefulness of the outputs, and we compared the Inter-Rater Reliability (IRR) of other human factors professionals (N = 2) attempting to code the original data with the outputs generated by each variant. Results demonstrate that even small, locally deployed LLMs can produce high-quality TA when guided by appropriate prompts. While the “baseline” variant performed surprisingly well for small datasets, we found that the other, scalable methods were dependent upon advanced prompt engineering techniques to be successful. Only our novel "cognition-inspired" approach performed as well as the “baseline” variant in qualitative and quantitative comparisons of ratings and coding IRR. This research provides practical guidance for human factors researchers looking to integrate LLMs into their qualitative analysis workflows, disentangling and uncovering the importance of context window limitations, batch processing strategies, and advanced prompt engineering techniques. The findings suggest that local LLMs can serve as valuable and scalable tools in thematic analysis.

  • Research Article
  • 10.1609/aaai.v39i23.34640
Augmenting Math Word Problems via Iterative Question Composing
  • Apr 11, 2025
  • Proceedings of the AAAI Conference on Artificial Intelligence
  • Haoxiong Liu + 3 more

Despite the advancements in large language models (LLMs) for mathematical reasoning, solving competition-level math problems remains a significant challenge, especially for open-source LLMs without external tools. We introduce the MMIQC dataset, comprising a mixture of processed web data and synthetic question-response pairs, aimed at enhancing the mathematical reasoning capabilities of base language models. Models fine-tuned on MMIQC consistently surpass their counterparts in performance on the MATH benchmark across various model sizes. Notably, Qwen-72B-MMIQC achieves a 45.0% accuracy, exceeding the previous open-source state-of-the-art by 8.2% and outperforming the initial version GPT-4 released in 2023. Extensive evaluation results on Hungarian high school finals suggest that such improvement can generalize to unseen data. Our ablation study on MMIQC reveals that a large part of the improvement can be attributed to our novel augmentation method, Iterative Question Composing (IQC), which involves iteratively composing new questions from seed problems using an LLM and applying rejection sampling through another LLM.

  • Research Article
  • Cite Count Icon 6
  • 10.3390/vehicles7010011
Large Language Models (LLMs) as Traffic Control Systems at Urban Intersections: A New Paradigm
  • Jan 27, 2025
  • Vehicles
  • Sari Masri + 2 more

This study introduces a novel approach for traffic control systems by using Large Language Models (LLMs) as traffic controllers. The study utilizes their logical reasoning, scene understanding, and decision-making capabilities to optimize throughput and provide feedback based on traffic conditions in real time. LLMs centralize traditionally disconnected traffic control processes and can integrate traffic data from diverse sources to provide context-aware decisions. LLMs can also deliver tailored outputs using various means such as wireless signals and visuals to drivers, infrastructures, and autonomous vehicles. To evaluate LLMs’ ability as traffic controllers, this study proposed a four-stage methodology. The methodology includes data creation and environment initialization, prompt engineering, conflict identification, and fine-tuning. We simulated multi-lane four-leg intersection scenarios and generated detailed datasets to enable conflict detection using LLMs and Python simulation as a ground truth. We used chain-of-thought prompts to lead LLMs in understanding the context, detecting conflicts, resolving them using traffic rules, and delivering context-sensitive traffic management solutions. We evaluated the performance of GPT-4o-mini, Gemini, and Llama as traffic controllers. Results showed that the fine-tuned GPT-mini achieved 83% accuracy and an F1-score of 0.84. The GPT-4o-mini model exhibited a promising performance in generating actionable traffic management insights, with high ROUGE-L scores across conflict identification of 0.95, decision making of 0.91, priority assignment of 0.94, and waiting time optimization of 0.92. This methodology confirmed LLMs’ benefits as a traffic controller in real-world applications. We demonstrated that LLMs can offer precise recommendations to drivers in real time including yielding, slowing, or stopping based on vehicle dynamics. This study demonstrates LLMs’ transformative potential for traffic control, enhancing efficiency and safety at intersections.

  • Research Article
  • Cite Count Icon 4
  • 10.1609/aaai.v37i13.26879
Exploring Social Biases of Large Language Models in a College Artificial Intelligence Course
  • Jun 26, 2023
  • Proceedings of the AAAI Conference on Artificial Intelligence
  • Skylar Kolisko + 1 more

Large neural network-based language models play an increasingly important role in contemporary AI. Although these models demonstrate sophisticated text generation capabilities, they have also been shown to reproduce harmful social biases contained in their training data. This paper presents a project that guides students through an exploration of social biases in large language models. As a final project for an intermediate college course in Artificial Intelligence, students developed a bias probe task for a previously-unstudied aspect of sociolinguistic or sociocultural bias they were interested in exploring. Through the process of constructing a dataset and evaluation metric to measure bias, students mastered key technical concepts, including how to run contemporary neural networks for natural language processing tasks; construct datasets and evaluation metrics; and analyze experimental results. Students reported their findings in an in-class presentation and a final report, recounting patterns of predictions that surprised, unsettled, and sparked interest in advocating for technology that reflects a more diverse set of backgrounds and experiences. Through this project, students engage with and even contribute to a growing body of scholarly work on social biases in large language models.

  • Research Article
  • 10.1038/s42005-025-01956-y
Quantum many-body physics calculations with large language models
  • Jan 31, 2025
  • Communications Physics
  • Haining Pan + 7 more

Large language models (LLMs) have demonstrated abilities to perform complex tasks in multiple domains, including mathematical and scientific reasoning. We demonstrate that with carefully designed prompts, LLMs can accurately carry out key calculations in research papers in theoretical physics. We focus on a broadly-used approximation method in quantum physics: the Hartree-Fock method, requiring an analytic multi-step calculation deriving approximate Hamiltonian and corresponding self-consistency equations. To carry out the calculations using LLMs, we design multi-step prompt templates that break down the analytic calculation into standardized steps with placeholders for problem-specific information. We evaluate GPT-4’s performance in executing the calculation for 15 papers from the past decade, demonstrating that, with the correction of intermediate steps, it can correctly derive the final Hartree-Fock Hamiltonian in 13 cases. Aggregating across all research papers, we find an average score of 87.5 (out of 100) on the execution of individual calculation steps. We further use LLMs to mitigate the two primary bottlenecks in this evaluation process: (i) extracting information from papers to fill in templates and (ii) automatic scoring of the calculation steps, demonstrating good results in both cases.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.

Search IconWhat is the difference between bacteria and viruses?
Open In New Tab Icon
Search IconWhat is the function of the immune system?
Open In New Tab Icon
Search IconCan diabetes be passed down from one generation to the next?
Open In New Tab Icon