Abstract

Recent studies suggest that the minimum error entropy (MEE) criterion can outperform the traditional mean square error criterion in supervised machine learning, especially in nonlinear and non-Gaussian situations. In practice, however, one has to estimate the error entropy from the samples since in general the analytical evaluation of error entropy is not possible. By the Parzen windowing approach, the estimated error entropy converges asymptotically to the entropy of the error plus an independent random variable whose probability density function (PDF) corresponds to the kernel function in the Parzen method. This quantity of entropy is called the smoothed error entropy, and the corresponding optimality criterion is named the smoothed MEE (SMEE) criterion. In this paper, we study theoretically the SMEE criterion in supervised machine learning where the learning machine is assumed to be nonparametric and universal. Some basic properties are presented. In particular, we show that when the smoothing factor is very small, the smoothed error entropy equals approximately the true error entropy plus a scaled version of the Fisher information of error. We also investigate how the smoothing factor affects the optimal solution. In some special situations, the optimal solution under the SMEE criterion does not change with increasing smoothing factor. In general cases, when the smoothing factor tends to infinity, minimizing the smoothed error entropy will be approximately equivalent to minimizing error variance, regardless of the conditional PDF and the kernel.

Highlights

  • The principles and methods in Shannon’s information theory have been widely applied in statistical estimation, filtering, and learning problems [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18]

  • Remark 8: The above result is very interesting: when the smoothing factor λ is very large, minimizing the smoothed error entropy will be approximately equivalent to minimizing the error variance

  • The optimality criteria based on second order statistics are computationally simple, and optimal under linear and Gaussian assumptions

Read more

Summary

Introduction

The principles and methods in Shannon’s information theory have been widely applied in statistical estimation, filtering, and learning problems [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18]. Familiar examples include the support vector machine (SVM) [19,20] and kernel adaptive filtering [21] In this case, the hypothesis space for learning is in general a high (possibly infinite) dimensional reproducing kernel Hilbert space (RKHS) H , and the optimal mapping under MEE criterion is:. By the Parzen windowing approach (with a fixed kernel function κ λ ), the estimated error entropy will converge almost surely (a.s.) to the entropy of the convolved density (see [22,23]). SMEE is an actual learning criterion (as sample number N → ∞ ) in ITL, up to now its theoretical properties have been little studied.

Some Basic Properties of SMEE Criterion
How Smoothing Factor Affects the Optimal Solution
Conclusions
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call