Exploring the Learning Difficulty of Data: Theory and Measure

Weiyao Zhu,Ou Wu,Yingjun Deng,Fengguang Su

doi:10.1145/3636512

Abstract

‘‘Easy/hard sample” is a popular parlance in machine learning. Learning difficulty of samples refers to how easy/hard a sample is during a learning procedure. An increasing need of measuring learning difficulty demonstrates its importance in machine learning (e.g., difficulty-based weighting learning strategies). Previous literature has proposed a number of learning difficulty measures. However, no comprehensive investigation for learning difficulty is available to date, resulting in that nearly all existing measures are heuristically defined without a rigorous theoretical foundation. This study attempts to conduct a pilot theoretical study for learning difficulty of samples. First, influential factors for learning difficulty are summarized. Under various situations conducted by summarized influential factors, correlations between learning difficulty and two vital criteria of machine learning, namely, generalization error and model complexity, are revealed. Second, a theoretical definition of learning difficulty is proposed on the basis of these two criteria. A practical measure of learning difficulty is proposed under the direction of the theoretical definition by importing the bias-variance trade-off theory. Subsequently, the rationality of theoretical definition and the practical measure are demonstrated, respectively, by analysis of several classical weighting methods and abundant experiments realized under all situations conducted by summarized influential factors. The mentioned weighting methods can be reasonably explained under the proposed theoretical definition and concerned propositions. The comparison in these experiments indicates that the proposed measure significantly outperforms the other measures throughout the experiments.

Full Text