Multi-channel speech enhancement aims at extracting the desired speech using a microphone array, which has many potential applications, such as video conferencing, automatic speech recognition, and hearing aids. Recently, deep learning-based spatial filters have achieved remarkable improvements over traditional beamformers, and the desired speech is often inferred directly using the noisy features without modeling the interference. In this work, a novel two-stage framework is proposed to extract the desired speech under the guidance of both the estimated interference and the desired signal. The resulting framework, called a Separation and Interaction Network (SI-Net), includes two components: the first module separates speech and interference coarsely, and the second sub-network serves as the post-processing module to suppress the residual noise and regenerate some missing speech components simultaneously under the guidance of previously estimated speech and interference characters. Because these two modules are both differentiable, the proposed framework can be trained in an end-to-end manner. In addition, a causal spatial-temporal attention module is designed to effectively model the inter-channel and inter-frame correlations simultaneously. Moreover, under this framework, we adopt the channel shuffle and gated fusion strategies for the interaction between speech and interference components to deliver the knowledge about both “where to suppress and where to enhance”. Experiments conducted on the simulated multi-channel speech dataset illustrate the superiority of the proposed framework over state-of-the-art baselines, while can still support real-time processing.