6D object pose estimation is an essential task in vision-based robotic grasping and manipulation. Prior works extract object's 6D pose by regressing from single RGB-D frame without considering the occluded objects in the frame, limiting their performance in human-robot collaboration scenarios with heavy occlusion. In this paper, we present an end-to-end model named \textit{TemporalFusion}, which integrates the temporal motion information from RGB-D images for 6D object pose estimation. The core of proposed model is to embed and fuse the temporal motion information from multi-frame RGB-D sequences, which could handle heavy occlusion in human-robot collaboration tasks. Furthermore, the proposed deep model can also obtain stable pose sequences, which is essential for real-time robotic grasping tasks. We evaluated the proposed method in the YCB-Video dataset, and experimental results show our model outperforms state-of-the-art approaches. Our code is available at https://github.com/mufengjun260/H-MPose.