With a goal of identifying biomarkers/patterns related to certain conditions or diseases, metabolomics focuses on the detection of chemical substances in biological samples such as urine and blood using a number of analytical techniques, including nuclear magnetic resonance (NMR) spectroscopy, liquid chromatography-mass spectrometry (LC–MS), and fluorescence spectroscopy. Data sets measured using these methods provide partly complementary information, and their joint analysis has the potential to reveal underlying structures, which are, otherwise, difficult to extract. While we can collect vast amounts of data using different analytical methods, data fusion remains a challenging task, in particular, when the goal is to capture the underlying factors and use them for interpretation, e.g., for biomarker identification. Furthermore, many data fusion applications require joint analysis of heterogeneous (i.e., in the form of higher order tensors and matrices) data sets with shared/unshared factors. In order to jointly analyze such heterogeneous data sets, we formulate data fusion as a coupled matrix and tensor factorization (CMTF) problem, which has already proved useful in many data mining applications, and discuss its extension to a structure-revealing data fusion model, i.e., a data fusion model that can identify shared and unshared factors. The traditional methods commonly used for data fusion in the presence of shared/unshared factors are matrix factorization-based methods. Using both simulations and prototypical experimental coupled data sets, we assess the performance of various state-of-the-art data fusion methods and demonstrate that while matrix factorization-based approaches have limitations when used for joint analysis of heterogeneous data sets, the structure-revealing CMTF model can successfully capture the underlying factors by exploiting the low-rank structure of higher order data sets.
Read full abstract