Abstract

The problem of designing NLP solvers for math word problems (MWP) has seen sustained research activity and steady gains in the test accuracy. Since existing solvers achieve high performance on the benchmark datasets for elementary level MWPs containing one-unknown arithmetic word problems, such problems are often considered “solved” with the bulk of research attention moving to more complex MWPs. In this paper, we restrict our attention to English MWPs taught in grades four and lower. We provide strong evidence that the existing MWP solvers rely on shallow heuristics to achieve high performance on the benchmark datasets. To this end, we show that MWP solvers that do not have access to the question asked in the MWP can still solve a large fraction of MWPs. Similarly, models that treat MWPs as bag-of-words can also achieve surprisingly high accuracy. Further, we introduce a challenge dataset, SVAMP, created by applying carefully chosen variations over examples sampled from existing datasets. The best accuracy achieved by state-of-the-art models is substantially lower on SVAMP, thus showing that much remains to be done even for the simplest of the MWPs.

Highlights

  • We provide strong evidence that the existing Math Word Problem (MWP) solvers rely on shallow heuristics to achieve high performance on the benchmark datasets

  • A Math Word Problem (MWP) consists of a short lems where the output is a mathematical expression natural language narrative describing a state of involving numbers and one or more arithmetic opthe world and poses a question about some un- erators (+, −, ∗, /)

  • This indicates that the models can rely cently, ASDiv (Miao et al, 2020) has been proon superficial patterns present in the narrative of posed to provide more diverse problems with anthe MWP and achieve high accuracy without even notations for equation, problem type and grade looking at the question

Read more

Summary

Introduction

This indicates that the models can rely cently, ASDiv (Miao et al, 2020) has been proon superficial patterns present in the narrative of posed to provide more diverse problems with anthe MWP and achieve high accuracy without even notations for equation, problem type and grade looking at the question. The presence of these issues in existing bench- Identifying artifacts in datasets has been done marks makes them unreliable for measuring the for the Natural Language Inference (NLI) task by performance of models. Ing SOTA models on SVAMP, we find that they Challenge Sets for NLP tasks have been proare not even able to solve half the problems in the posed most notably for NLI and machine transladataset. We create a challenge set called SVAMP 1 for more robust evaluation of methods developed to solve elementary level math word problems

Related Work
Datasets and Methods
Analyzing the attention weights
SVAMP it appears to be of higher quality and harder than the MAWPS dataset
Protocol for creating variations
B Creation Protocol
Findings
E Ethical Considerations

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.