Exploiting partially-labeled data in learning predictive clustering trees for multi-target regression: A case study of water quality assessment in Ireland

Stevanche Nikoloski,Dragi Kocev,Jurica Levatić,David P Wall,Sašo Džeroski

doi:10.1016/j.ecoinf.2020.101161

Abstract

Many environmental problems give rise to predictive modeling tasks where several dependent variables need to be predicted simultaneousy from a given set of independent variables. When the target variables are numeric, the task at hand is called multi-target regression (MTR). An example task of this type is the assessment of quality of agricultural waters in Ireland according to three indicators: biological water quality, nitrogen concentration and phosphorus concentration.Multi-target regression models are typically learnt from labeled training examples, where the values of both the dependent variables (labels) and the independent variables are provided, in a setting known as supervised learning. Many different approaches to supervised multi-target regression have been developed, among which predictive clustering trees and ensembles thereof stand out due to their effectiveness and efficiency. Recently, these approaches have been extended to exploit not only labeled examples, but also unlabeled examples, where only the values of the independent variables are provided, a setting known as semi-supervised learning.In practice, training data can also contain partially labeled examples, where the values of some of the dependent variables are provided and others are missing (in addition to fully labeled examples where all target values are provided and completely unlabeled examples where no target values are provided). For the task of water quality assessment in Ireland, we encounter this kind of partially labeled data. Existing supervised and semi-supervised MTR approaches typically ignore partially labeled data.In this paper, we propose the use of semi-supervised predictive clustering trees for MTR that can handle partially labeled examples. We apply these to the task of assessment of water quality in Ireland, showing that better performance can be achieved if partially labeled examples are exploited, rather than discarded. We build both local models (collections of single-target models predicting each target separately) and global models (multi-target models simultaneously predicting all targets), showing that global models are both smaller and easier to interpret, and also overfit less (and have better performance) as compared to local models.

Full Text