Conservation machine learning: a case study of random forests

Moshe Sipper,Jason H Moore

doi:10.1038/s41598-021-83247-4

Moshe Sipper, Jason H Moore

Open Access

https://doi.org/10.1038/s41598-021-83247-4

Copy DOI

Abstract

Conservation machine learning conserves models across runs, users, and experiments—and puts them to good use. We have previously shown the merit of this idea through a small-scale preliminary experiment, involving a single dataset source, 10 datasets, and a single so-called cultivation method—used to produce the final ensemble. In this paper, focusing on classification tasks, we perform extensive experimentation with conservation random forests, involving 5 cultivation methods (including a novel one introduced herein—lexigarden), 6 dataset sources, and 31 datasets. We show that significant improvement can be attained by making use of models we are already in possession of anyway, and envisage the possibility of repositories of models (not merely datasets, solutions, or code), which could be made available to everyone, thus having conservation live up to its name, furthering the cause of data and computational science.

Highlights

Conservation machine learning conserves models across runs, users, and experiments—and puts them to good use
We recently presented the idea of conservation machine learning, wherein machine learning (ML) models are saved across multiple runs, users, and e xperiments[1]
We believe the novelty of conservation machine learning, applied to random forests, is two-fold

Summary

Introduction

Conservation machine learning conserves models across runs, users, and experiments—and puts them to good use. We show that significant improvement can be attained by making use of models we are already in possession of anyway, and envisage the possibility of repositories of models (not merely datasets, solutions, or code), which could be made available to everyone, having conservation live up to its name, furthering the cause of data and computational science. A random forest (RF) is an oft-used ensemble technique that employs a forest of decision-tree classifiers on various sub-samples of the dataset, with random subsets of the features for node splits It uses majority voting (for classification problems) or averaging (for regression problems) to improve predictive accuracy and control over-fitting[2]. Reference[3] presented a method for constructing ensembles from libraries of thousands of models They used a simple hill-climbing procedure to build the final ensemble, and successfully tested their method on 7 problems. Our second contribution in this paper is the introduction of a new ensemble cultivation method—lexigarden

Methods

Results

Discussion

Conclusion