Abstract

We present ASDiv (Academia Sinica Diverse MWP Dataset), a diverse (in terms of both language patterns and problem types) English math word problem (MWP) corpus for evaluating the capability of various MWP solvers. Existing MWP corpora for studying AI progress remain limited either in language usage patterns or in problem types. We thus present a new English MWP corpus with 2,305 MWPs that cover more text patterns and most problem types taught in elementary school. Each MWP is annotated with its problem type and grade level (for indicating the level of difficulty). Furthermore, we propose a metric to measure the lexicon usage diversity of a given MWP corpus, and demonstrate that ASDiv is more diverse than existing corpora. Experiments show that our proposed corpus reflects the true capability of MWP solvers more faithfully.

Highlights

  • Human math/science tests have been considered more suitable for evaluating AI progress than the Turing test (Clark and Etzioni, 2016)

  • We present an math word problem (MWP) corpus which is highly diverse in terms of lexicon usage and covers most problem types taught in elementary school

  • Each MWP is annotated with the corresponding problem type, equation, and grade level, which are useful for machine learning and assessing the difficulty level of each MWP

Read more

Summary

Introduction

Human math/science tests have been considered more suitable for evaluating AI progress than the Turing test (Clark and Etzioni, 2016). Solution: 0.75 x 2 + 0.25 x 4 = 2.5 These existing corpora are either limited in terms of the diversity of the associated problem types (as well as lexicon usage patterns), or lacking information such as difficulty levels. Low-diversity corpora are typically characterized by highly similar problems, which usually yields over-optimistic results (Huang et al, 2016) (as the answer frequently can be obtained from the existing equation template associated with the most similar MWP in the training-set). (1) We construct a diverse (in terms of lexicon usage), wide-coverage (in problem type), and publicly available MWP corpus, with annotations that. (2) We propose a lexicon usage diversity metric to measure the diversity of an MWP corpus and use it to evaluate existing corpora. TVQ2 denotes an entity-state related variable (e.g., initial/current/final-state and change) whose value is systems is still far behind human performance if updated sequentially according to a sequence of evaluated on a corpus that mimics a real human test

Problem Type
ASDiv Math Word Problem Corpus
Corpus Diversity Metrics
Corpus Construction
LD Distributions of Various Corpora
Experiments
Findings
Conclusion and Future Work
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call