RGB-Thermal Video Object Detection (RGBT VOD) is to localize and classify the predefined objects in visible and thermal spectrum videos. The key issue in RGBT VOD lies in integrating multi-modal information effectively to improve detection performance. Current multi-modal fusion methods predominantly employ middle fusion strategies, but the inherent modal difference directly influences the effect of multi-modal fusion. Although the early fusion strategy reduces the modality gap in the middle stage of the network, achieving in-depth feature interaction between different modalities remains challenging. In this work, we propose a novel hybrid fusion network called PTMNet, which effectively combines the early fusion strategy with the progressive interaction and the middle fusion strategy with the temporal-modal difference, for high performance RGBT VOD. In particular, we take each modality as a master modality to achieve an early fusion with other modalities as auxiliary information by progressive interaction. Such a design not only alleviates the modality gap but facilitates middle fusion. The temporal-modal difference models temporal information through spatial offsets and utilizes feature erasure between modalities to motivate the network to focus on shared objects in both modalities. The hybrid fusion can achieve high detection accuracy only using three input frames, which makes our PTMNet achieve a high inference speed. Experimental results show that our approach achieves state-of-the-art performance on the VT-VOD50 dataset and also operates at over 70 FPS. The code will be freely released at https://github.com/tzz-ahu for academic purposes.