Construction and Update of an Online Ensemble Score Involving Linear Discriminant Analysis and Logistic Regression

Benoît Lalloué,Jean-Marie Monnez,Eliane Albuisson

doi:10.4236/am.2022.132018

Abstract

The present aim is to update, upon arrival of new learning data, the parameters of a score constructed with an ensemble method involving linear discriminant analysis and logistic regression in an online setting, without the need to store all of the previously obtained data. Poisson bootstrap and stochastic approximation processes were used with online standardized data to avoid numerical explosions, the convergence of which has been established theoretically. This empirical convergence of online ensemble scores to a reference “batch” score was studied on five different datasets from which data streams were simulated, comparing six different processes to construct the online scores. For each score, 50 replications using a total of 10N observations (N being the size of the dataset) were performed to assess the convergence and the stability of the method, computing the mean and standard deviation of a convergence criterion. A complementary study using 100N observations was also performed. All tested processes on all datasets converged after N iterations, except for one process on one dataset. The best processes were averaged processes using online standardized data and a piecewise constant step-size.

Highlights

When considering the problem of predicting the values of a dependent variable y, whether continuous or categorical, from observed variables x1, x p, which are themselves continuous or categorical, many different predictors can be constructed to address this problem
Datasets Four datasets available on the Internet and one dataset derived from the EPHESUS study [17] were used, all of which have previously been utilized to test the performance of stochastic approximation processes with online standardized data in the case of online linear regression [10] and online logistic regression [11]
Convergence and Stability Analysis for 10N Observations When replicating each couple of processes 50 times, the mean criterion values were lower than 0.05 for all couples of processes applied on Twonorm, Ringnorm and Quantum datasets (Table 3)

Summary

Introduction

Various types of regressions or rules of classification can be used as well as different samples (e.g. bootstrap), different variable selection methods (random, stepwise selection, shrinkage methods, etc.) or more generally by introducing a random element in the construction of predictors. Bagging [2], boosting [3], random forests [1] or Random Generalized Linear Models (RGLM) [4] are examples of ensemble methods Another method for constructing an ensemble score in seven steps was recently proposed in Duarte et al [5], used in Lalloué et al [6] and will be used as a reference in this article: 1) Selection of n1 classification rules. The theoretical guarantees of convergence already demonstrated for this type of process provide little information on the practical choices to be made

Methods

Results

Conclusion