Harmonizing and aligning M/EEG datasets with covariance-based techniques to enhance predictive regression modeling

Apolline Mellot,Alexandre Gramfort,Pedro L. C. Rodrigues,Denis Engemann,Antoine Collas

doi:10.1162/imag_a_00040

Abstract

Abstract Neuroscience studies face challenges in gathering large datasets, which limits the use of machine learning (ML) approaches. One possible solution is to incorporate additional data from large public datasets; however, data collected in different contexts often exhibit systematic differences called dataset shifts. Various factors, for example, site, device type, experimental protocol, or social characteristics, can lead to substantial divergence of brain signals that can hinder the success of ML across datasets. In this work, we focus on dataset shifts in recordings of brain activity using MEG and EEG. State-of-the-art predictive approaches on magneto- and electroencephalography (M/EEG) signals classically represent the data by covariance matrices. Model-based dataset alignment methods can leverage the geometry of covariance matrices, leading to three steps: re-centering, re-scaling, and rotation correction. This work explains theoretically how differences in brain activity, anatomy, or device configuration lead to certain shifts in data covariances. Using controlled simulations, the different alignment methods are evaluated. Their practical relevance is evaluated for brain age prediction on one MEG dataset (Cam-CAN, n = 646) and two EEG datasets (TUAB, n = 1385; LEMON, n = 213). Among the same dataset (Cam-CAN), when training and test recordings were from the same subjects but performing different tasks, paired rotation correction was essential (δR2=+0.13 (rest-passive) or +0.17 (rest-smt)). When in addition to different tasks we included unseen subjects, re-centering led to improved performance (δR2=+0.096 for rest-passive, δR2=+0.045 for rest-smt). For generalization to an independent dataset sampled from a different population and recorded with a different device, re-centering was necessary to achieve brain age prediction performance close to within dataset prediction performance. This study demonstrates that the generalization of M/EEG-based regression models across datasets can be substantially enhanced by applying domain adaptation procedures that can statistically harmonize diverse datasets.

Full Text