Nowadays, how to estimate vigilance with higher accuracy has become a hot field of research direction. Although the increasing available modalities opens the door for amazing new possibilities to achieve good performance, the uncertain cross-modal interaction still poses a real challenge to the multimodal fusion. In this paper, a cross-modality alignment method has been proposed based on the contrastive learning for extracting shared but not the same information among modalities. The contrastive learning is adopted to minimize the intermodal differences by maximizing the similarity of semantic representation of modalities. Applying our proposed modeling framework, we evaluated our approach on SEED-VIG dataset consisting of EEG and EOG signals. Experiments showed that our study achieved state-of-the-art multimodal vigilance estimation performance both in intra-subject and inter-subject situations, the average of RMSE/CORR were improved to 0.092/0.893 and 0.144/0.887, respectively. In addition, analysis on the frequency bands showed that theta and alpha activities contain valuable information for vigilance estimation, and the correlation between them and PERCLOS can be significantly improved by contrastive learning. We argue that the proposed method in the inter-subject case could offer the possibility of reducing the high-cost of data annotation, and further analysis may provide an idea for the application of multimodal vigilance regression.
Read full abstract