Pilot stress detection is a challenging task and it plays a vital role in improving flight performance and avoiding catastrophic accidents. Many deep learning models have been adopted for stress recognition. However, these models tend to ignore the dependencies between multi-modal physiological signals, which can boost the model performance potentially. A transformer-based deep learning framework, which can obtain the position information of multi-modal signals by combining a transformer network with a traditional convolutional neural network (CNN), is proposed for detecting pilot stress. 14 pilots’ physiological data, including electrocardiography (ECG), electromyography (EMG), electrodermal (EDA), respiration (RESP), and skin temperature (SKT), under different stress states are collected for training and validation, and evaluated among different state-of-the-art models. The results show that the proposed model achieves an accuracy of 93.28%, 88.75%, and 84.85% for 2-class, 3-class, and 4-class classification tasks respectively, showing faster integration and promising performance.