Improving high-dimensional data fusion by exploiting the multivariate advantage

B.P Geurts,J Engel,B Rafii,L Blanchet,A Suppers,E Szymańska,J.J Jansen,L.M.C Buydens

doi:10.1016/j.chemolab.2016.05.010

Abstract

As no analytical chemical platform exists that is able to characterize the full chemical composition of a sample, often multiple platforms are used to measure the same sample. The chemometric analysis of the resulting data then requires the data to be ‘fused’. The more comprehensive view on each sample should enhance understanding of the underlying chemistry, and/or increase predictive accuracy of the resulting model. Different data fusion approaches have been proposed for this purpose; each has its own drawbacks and advantages. In this paper we propose a new strategy for data fusion by combining the advantages of low-level fusion with those of mid and high-level data fusion. We argue that the information that is usually discarded in the latter fusion approaches can still benefit both classification and regression when multiple data blocks are considered together. This information may be recovered by a regression employing the intraclass correlation between the discarded and retained data. A comprehensive simulation study shows that, for classification, the resulting data fusion method outperforms the conventional data fusion approaches in many scenarios of communal information between data blocks. A real-life example on predicting the bitterness of different beers shows that the method also has great potential for regression.

Full Text