Video inpainting is a task of synthesizing spatio-temporal coherent content in missing regions of the given video sequence, which has recently drawn increasing attention. To utilize the temporal information across frames, most recent deep learning-based methods align reference frames to target frame firstly with explicit or implicit motion estimation and then integrate the information from the aligned frames. However, their performance relies heavily on the accuracy of frame-to-frame alignment. To alleviate the above problem, in this paper, a novel Temporal Group Fusion Network (TGF-Net) is proposed to effectively integrate temporal information through a two-stage fusion strategy. Specifically, the input frames are reorganized into different groups, where each group is followed by an intra-group fusion module to integrate information within the group. Different groups provide complementary information for the missing region. A temporal attention model is further designed to adaptively integrate the information across groups. Such a temporal information fusion way gets rid of the dependence on alignment operations, greatly improving the visual quality and temporal consistency of the inpainted results. In addition, a coarse alignment model is introduced at the beginning of the network to handle videos with large motion. Extensive experiments on DAVIS and Youtube-VOS datasets demonstrate the superiority of our proposed method in terms of PSNR/SSIM values, visual quality and temporal consistency, respectively.