Chinese Comma Disambiguation in Math Word Problems Using SMOTE and Random Forests

Jingxiu Huang,Linjing Wu,Qingtang Liu,Yunxiang Zheng

doi:10.3390/ai2040044

Jingxiu Huang, Linjing Wu + Show 2 more

Open Access

https://doi.org/10.3390/ai2040044

Copy DOI

Abstract

Natural language understanding technologies play an essential role in automatically solving math word problems. In the process of machine understanding Chinese math word problems, comma disambiguation, which is associated with a class imbalance binary learning problem, is addressed as a valuable instrument to transform the problem statement of math word problems into structured representation. Aiming to resolve this problem, we employed the synthetic minority oversampling technique (SMOTE) and random forests to comma classification after their hyperparameters were jointly optimized. We propose a strict measure to evaluate the performance of deployed comma classification models on comma disambiguation in math word problems. To verify the effectiveness of random forest classifiers with SMOTE on comma disambiguation, we conducted two-stage experiments on two datasets with a collection of evaluation measures. Experimental results showed that random forest classifiers were significantly superior to baseline methods in Chinese comma disambiguation. The SMOTE algorithm with optimized hyperparameter settings based on the categorical distribution of different datasets is preferable, instead of with its default values. For practitioners, we suggest that hyperparameters of a classification models be optimized again after parameter settings of SMOTE have been changed.

Highlights

Solving math word problems (MWPs), which dates back to 1960s [1], gains intensive attention from international scholars [2,3]
We show that the combination of random forest classifiers and the algorithm
This table gives a general view of the performance of optimized random forest (RF) classifiers on evaluation metrics, including True Positive Rate (TPR), True Negative Rate (TNR), Weighted Accuracy (WA), GM, Matthews Correlation Coefficient (MCC) and area under curve (AUC)

Summary

Introduction

Solving math word problems (MWPs), which dates back to 1960s [1], gains intensive attention from international scholars [2,3]. In a math word problem solver, natural language understanding technologies were adapted to automatically transform. MWPs into a structured representation that facilitates quantitative reasoning. The formalized representation is essentially derived from a list of textual propositions within a sentence or several sentence fragments. With reference to the Rhetorical Structure Theory [4], textual propositions in a math word problem can be defined as quantitative discourse units (QDUs) analogically. Commas serve as an important cue for detecting QDUs in the problem statement of MWPs. In a general way, MWPs can be split into several QDUs by periods Sentence fragments surrounding commas express essential relations for understanding the meaning of the sentence [6].

Methods

Results

Conclusion