A hierarchical Bayesian model to find brain-behaviour associations in incomplete data sets

Fabio S Ferreira,Agoston Mihalik,Rick A Adams,John Ashburner,Janaina Mourao-Miranda

doi:10.1016/j.neuroimage.2021.118854

Fabio S Ferreira, Agoston Mihalik + Show 3 more

Open Access

https://doi.org/10.1016/j.neuroimage.2021.118854

Copy DOI

Abstract

Canonical Correlation Analysis (CCA) and its regularised versions have been widely used in the neuroimaging community to uncover multivariate associations between two data modalities (e.g., brain imaging and behaviour). However, these methods have inherent limitations: (1) statistical inferences about the associations are often not robust; (2) the associations within each data modality are not modelled; (3) missing values need to be imputed or removed. Group Factor Analysis (GFA) is a hierarchical model that addresses the first two limitations by providing Bayesian inference and modelling modality-specific associations. Here, we propose an extension of GFA that handles missing data, and highlight that GFA can be used as a predictive model. We applied GFA to synthetic and real data consisting of brain connectivity and non-imaging measures from the Human Connectome Project (HCP). In synthetic data, GFA uncovered the underlying shared and specific factors and predicted correctly the non-observed data modalities in complete and incomplete data sets. In the HCP data, we identified four relevant shared factors, capturing associations between mood, alcohol and drug use, cognition, demographics and psychopathological measures and the default mode, frontoparietal control, dorsal and ventral networks and insula, as well as two factors describing associations within brain connectivity. In addition, GFA predicted a set of non-imaging measures from brain connectivity. These findings were consistent in complete and incomplete data sets, and replicated previous findings in the literature. GFA is a promising tool that can be used to uncover associations between and within multiple data modalities in benchmark datasets (such as, HCP), and easily extended to more complex models to solve more challenging tasks.

Highlights

In the past few years, there has been a substantial increase in the application of multivariate methods, such as Canonical Correlation Analysis (CCA) (Hotelling, 1936), to identify associations between two data modalities
CCA Canonical Correlation Analysis was introduced by Hotelling (1936) and is a classical method for seeking maximal correlations between linear combinations of two multivariate data sets, which can be seen as two different data modalities from the same observations or individuals
In the multi-output prediction task, we showed that the model could make reasonable predictions when the data was missing randomly or one modality was missing for some observations, i.e., the mean squared error (MSE) were similar across experiments and below chance level (Table 1)

Summary

Introduction

In the past few years, there has been a substantial increase in the application of multivariate methods, such as Canonical Correlation Analysis (CCA) (Hotelling, 1936), to identify associations between two data modalities (e.g., brain imaging and behaviour). CCA and regularised variants of CCA, such as sparse CCA (Waaijenborg et al, 2008; Le Cao et al, 2009; Witten et al, 2009), have been used to identify associations, for instance, between brain connectivity and cognitive/psychopathology measures (Drysdale et al, 2017; Xia et al, 2018; Mihalik et al, 2019), brain connectivity and general lifestyle, demographic and behavioural measures (Smith et al, 2015; Bijsterbosch et al, 2018; Lee et al, 2019; Li et al, 2019; Alnæs et al, 2020), brain structure, demographic and behavioural measures (Monteiro et al, 2016; Mihalik et al, 2020) and between different brain imaging modalities (Sui et al, 2012) This is a common issue in clinical and neuroimaging datasets, in which the missing values usually need to be imputed or removed before applying the models

Methods

Results

Discussion

Conclusion