Image inpainting requires filling the corrupted regions with plausible contents and fitting seamlessly with the surroundings. Currently, CNN-dominated inpainting models still produce artifacts inconsistent with surroundings when facing large-scale corrupted regions. This is attributed to the locality and weight sharing of convolutions, which limits the ability of CNNs to learn the correlations between corrupted contents and distant contexts explicitly. In contrast, Transformer-dominated models are experts in modeling such long-range correlations, but they often deteriorate local details due to the lack of locality. A potential inpainting model should incorporate the advantages of CNN and Transformer. To do this, we propose an inpainting-oriented concurrent hybrid network, termed bidirectional interaction dual-stream network (BIDS-Net), which can sufficiently integrate the advantages of CNN and Transformer. Specifically, we adopt a CNN stream to capture rich local patterns to achieve detail reconstruction and refinement, and a Transformer stream to model long-range contextual correlations to fully borrow the relevant information within a global field of view. More importantly, we devise a bidirectional feature interaction (BFI) module based on a selective feature fusion method, which aims to enhance the locality of the transformer stream and the long-range correlation awareness of the CNN stream to realize full complementary advantages. In addition, we build both of the streams upon a hierarchical encoder–decoder structure to facilitate multi-scale context reasoning and improve computational efficiency. The comprehensive experimental results on three public datasets demonstrate the superiority of the proposed BIDS-Net in image inpainting task.