Sleep stage classification is essential for clinical disease diagnosis and sleep quality assessment. Most of the existing methods for sleep stage classification are based on single-channel or single-modal signal, and extract features using a single-branch, deep convolutional network, which not only hinders the capture of the diversity features related to sleep and increase the computational cost, but also has a certain impact on the accuracy of sleep stage classification. To solve this problem, this paper proposes an end-to-end multi-modal physiological time-frequency feature extraction network (MTFF-Net) for accurate sleep stage classification. First, multi-modal physiological signal containing electroencephalogram (EEG), electrocardiogram (ECG), electrooculogram (EOG) and electromyogram (EMG) are converted into two-dimensional time-frequency images containing time-frequency features by using short time Fourier transform (STFT). Then, the time-frequency feature extraction network combining multi-scale EEG compact convolution network (Ms-EEGNet) and bidirectional gated recurrent units (Bi-GRU) network is used to obtain multi-scale spectral features related to sleep feature waveforms and time series features related to sleep stage transition. According to the American Academy of Sleep Medicine (AASM) EEG sleep stage classification criterion, the model achieved 84.3% accuracy in the five-classification task on the third subgroup of the Institute of Systems and Robotics of the University of Coimbra Sleep Dataset (ISRUC-S3), with 83.1% macro F1 score value and 79.8% Cohen's Kappa coefficient. The experimental results show that the proposed model achieves higher classification accuracy and promotes the application of deep learning algorithms in assisting clinical decision-making.