Arithmetic Domain Research Articles

As the usage of large language models for problems outside of simple text understanding or generation increases, assessing their abilities and limitations becomes crucial. While significant progress has been made in this area over the last few years, most research has focused on benchmarking English, leaving other languages underexplored. This makes evaluating the reasoning and robustness level of language models in Ukrainian particularly challenging. The purpose of this work is to establish a comprehensive benchmark for the reasoning capabilities evaluation of large language models in the Ukrainian language. This paper presents the ZNO-Eval benchmark based on real exam tasks from Ukraine's standardized educational testing system: the External Independent Evaluation and the National Multi-subject Test. With single- answer options, multiple-choice, matching, and open-ended questions from diverse subjects, including Ukrainian language, mathematics, history, and geography, this dataset paves the way toward a thorough analysis of reasoning capabilities across different domains and complexities. Evaluation of several well-known language models, such as GPT-3.5-Turbo, GPT-4o, GPT-4-Turbo, Mistral Large, Claude 3 Opus, and Gemini-1.5 Pro on this benchmark demonstrated the superiority of GPT-4o in both common knowledge reasoning and intricate language tasks. At the same time, Gemini Pro and GPT-4 Turbo excelled in the arithmetic domain, leading in single-answer and open-ended math problems. While all models were close to max performance in text-only common knowledge tasks like history and geography, there still is a gap for Ukrainian language and math, thus highlighting the importance of developing specialized language benchmarks for more accurate assessments of model capabilities and limitations across different languages and contexts. This research introduced ZNO-Eval, an effective benchmark for evaluating reasoning capabilities, and thoroughly explored the abilities and limitations of modern solutions in the Ukrainian language. Future research should aim to expand the scope of ZNO-Eval to other modalities like images commonly used for exam problem description.

Bacterial enumeration data are typically log transformed to realize a more normal distribution and stabilize the variance. Unfortunately, statistical results from log transformed data are often misinterpreted as data within the arithmetic domain. To explore the implication of slope and intercept from an unweighted linear regression and compare it to the results of the regression of log transformed data. Mathematical formulae inferencing explained using real dataset. For y=Ax+B+ε, where y is the recovery (CFU/g) and x is the target concentration (CFU/g) with error ε homogeneous across x. When B=0, slope A estimates percent recovery R. In the regression of log transformed data, logy=αlogx+β+εz (equivalent to equation y=Axα·ω), it is the intercept β=logyx=logA that estimates the percent recovery in logarithm when slope α=1, which means that R doesn't vary over x. Error term ω is multiplicative to x, while εz or log(ω) is additive to log(x). Whether the data should be transformed or not is not a choice, but a decision based on the distribution of the data. Significant difference was not found between the five models (the linear regression of log transformed data, three generalized linear models and a nonlinear model) regarding their predicted percent recovery when applied to our data. An acceptable regression model should result in approximately the best normal distribution of residuals. Statistical procedures making use of log transformed data should be studied separately and documented as such, not collectively reported and interpreted with results studied in arithmetic domain. The way to interpret statistical results developed from arithmetic domain does not apply to that of the log transformed data.

Arithmetic Domain Research Articles

Related Topics

Articles published on Arithmetic Domain

ZNO-Eval: Benchmarking reasoning capabilities of large language models in Ukrainian

Brain markers of subtraction and multiplication skills in childhood: task-based functional connectivity and individualized structural similarity.

Long-term outcome of pediatric head injuries – A five-year follow-up

DUAL SENSORY LOSS AND COGNITIVE TEST PERFORMANCE IN OLDER ADULTS IN INDIA

Higher level domain specific skills in mathematics; The relationship between algebra, geometry, executive function skills and mathematics achievement.

Counteracting the dynamic degradation of high-dimensional digital chaotic systems via a stochastic jump mechanism

Principal Component Analysis of Oxford Cognitive Screen in Patients With Stroke.

The Graph Structure of the Generalized Discrete Arnold's Cat Map

Modelling for Clinical and Psysiological Evaluation of Diabetes and Glucose Homeostasis

Relationships between cognitive pattern recognition and specific mathematical domains in mathematics education

Realistic Mathematics Education Principles for Designing a Learning Sequence on Number Patterns

Developmental changes in size effects for simple tie and non-tie addition problems in 6- to 12-year-old children and adults

Interpretation and Implications of Lognormal Linear Regression Used for Bacterial Enumeration.

Logarithmic transformation and peak-discharge power-law analysis

Julian Huxley and the quantification of relative growth

Altered association between executive functions and reading and math fluency tasks in children with reading difficulties compared with typical readers.

Choice between helping vs not helping: level of mastery in the task

Relative Growth by the Elongated Jaws of Gars: A Perspective on Polyphasic Loglinear Allometry.

Quantifying the curvilinear metabolic scaling in mammals.

Autonomous mobile drilling mechanism with metamorphic function

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Arithmetic Domain Research Articles

Related Topics

Articles published on Arithmetic Domain

ZNO-Eval: Benchmarking reasoning capabilities of large language models in Ukrainian

Brain markers of subtraction and multiplication skills in childhood: task-based functional connectivity and individualized structural similarity.

Long-term outcome of pediatric head injuries – A five-year follow-up

DUAL SENSORY LOSS AND COGNITIVE TEST PERFORMANCE IN OLDER ADULTS IN INDIA

Higher level domain specific skills in mathematics; The relationship between algebra, geometry, executive function skills and mathematics achievement.

Counteracting the dynamic degradation of high-dimensional digital chaotic systems via a stochastic jump mechanism

Principal Component Analysis of Oxford Cognitive Screen in Patients With Stroke.

The Graph Structure of the Generalized Discrete Arnold's Cat Map

Modelling for Clinical and Psysiological Evaluation of Diabetes and Glucose Homeostasis

Relationships between cognitive pattern recognition and specific mathematical domains in mathematics education

Realistic Mathematics Education Principles for Designing a Learning Sequence on Number Patterns

Developmental changes in size effects for simple tie and non-tie addition problems in 6- to 12-year-old children and adults

Interpretation and Implications of Lognormal Linear Regression Used for Bacterial Enumeration.

Logarithmic transformation and peak-discharge power-law analysis

Julian Huxley and the quantification of relative growth

Altered association between executive functions and reading and math fluency tasks in children with reading difficulties compared with typical readers.

Choice between helping vs not helping: level of mastery in the task

Relative Growth by the Elongated Jaws of Gars: A Perspective on Polyphasic Loglinear Allometry.

Quantifying the curvilinear metabolic scaling in mammals.

Autonomous mobile drilling mechanism with metamorphic function