The Rapid Serial Visual Presentation (RSVP)-based Brain–Computer Interface (BCI) is an efficient technology for target retrieval using electroencephalography (EEG) signals. The performance improvement of traditional decoding methods relies on a substantial amount of training data from new test subjects, which increases preparation time for BCI systems. Several studies introduce data from existing subjects to reduce the dependence of performance improvement on data from new subjects, but their optimization strategy based on adversarial learning with extensive data increases training time during the preparation procedure. Moreover, most previous methods only focus on the single-view information of EEG signals, but ignore the information from other views which may further improve performance. To enhance decoding performance while reducing preparation time, we propose a Temporal-Spectral fusion transformer with Subject-specific Adapter (TSformer-SA). Specifically, a cross-view interaction module is proposed to facilitate information transfer and extract common representations across two-view features extracted from EEG temporal signals and spectrogram images. Then, an attention-based fusion module fuses the features of two views to obtain comprehensive discriminative features for classification. Furthermore, a multi-view consistency loss is proposed to maximize the feature similarity between two views of the same EEG signal. Finally, we propose a subject-specific adapter to rapidly transfer the knowledge of the model trained on data from existing subjects to decode data from new subjects. Experimental results show that TSformer-SA significantly outperforms comparison methods and achieves outstanding performance with limited training data from new subjects. This facilitates efficient decoding and rapid deployment of BCI systems in practical use.