Temporal fusion approaches are critical for 3D visual perception tasks in IOV (Internet of Vehicles), but they often rely on intermediate representations without fully utilizing position information from the previous frame’s detection results, which cannot compensate for the lack of depth information in visual data. In this work, we propose a novel framework called OccTr (Occupancy Transformer) that combines two temporal cues, intermediate representation and back-end representation, via occupancy map to enhance temporal fusion in object detection task. OccTr leverages attention mechanisms to perform both intermediate and back-end temporal fusion by incorporating intermediate BEV (bird’s-eye view) features and back-end prediction results of the detector. Our two-stage framework includes occupancy map generation and cross-attention feature fusion. In stage one, the prediction results are converted into occupancy grid map format to generate back-end representation. In stage two, the high-resolution occupancy maps are fused with BEV features using cross-attention layers. This fused temporal cue provides a strong prior for the temporal detection process. Experimental results demonstrate the effectiveness of our method in improving detection performance, achieving an NDS (nuScenes Detection Score) metric score of 37.35% on the nuScenes test set, which is 1.94 points higher than the baseline.