With the advancement of machine learning and big data technologies, BEV(Bird’s Eye View)-based methodologies have recently achieved significant breakthroughs in multi-view 3D occupancy prediction tasks. However, BEV(Bird’s Eye View)-centric 3D occupancy prediction continues to grapple with feature representation and annotation costs when applied to complex open environments. In order to surmount these issues and further propel the evolution of 3D occupancy tasks, this study introduces a novel framework termed TDOcc. By leveraging multi-camera imagery, TDOcc executes 3D semantic occupancy prediction by directly learning from unprocessed 3D spaces, thereby maximizing information retention. TDOcc presents two notable advantages: firstly, it utilizes dense occupancy labels, which not only facilitate robust dense occupancy inference but also enable comprehensive object estimation within the scene. Secondly, the framework synthesizes historical feature information by adeptly aligning past and present features through temporal cues, thereby bolstering the efficacy of the feature fusion module. Additionally, with a view to address the ill-posed nature inherent in camera-based 3D occupancy prediction, we incorporate an enhancement module that operates within the 3D feature space. This module has been meticulously crafted for the training phase to amplify the model’s learning potential. Extensive experiments conducted on the widely recognized nuScenes dataset underscore the efficacy of our proposed approach. Compared to the most recent TPVFormer and OccFormer, our approach has achieved a significant improvement in mean Intersection over Union (mIoU) by 2.0 and 0.8 respectively, and has reached performance comparable to the state-of-the-art LiDAR-based methods.