Flame detection is a key module of fire-fighting robots, especially for autonomous fire suppression. To effectively tackle the fire-fighting tasks, fire-fighting robots are usually equipped with multimodal vision systems. On the one hand, cameras of different modalities can provide complementary visual information. On the other hand, the differences in installation position and resolution between different cameras also result in weakly aligned image pairs, that is, the positions of the same object in different modal images are inconsistent. Directly fusing the image features of different modalities is difficult to meet the accuracy and false alarm requirements of fire-fighting robots. Therefore, we propose a multimodal flame detection model based on projection and attention guidance. First, we use projection to obtain the approximate position of the flame in the thermal image and employ a neighbor sampling module to detect flames around it. Second, we design an attention guidance module based on index matching, which applies the attention map generated by the thermal modality to optimize the regional feature of the color modality. Experiments on multimodal datasets collected by an actual fire-fighting robot validate that the proposed method is effective in both fire and nonfire environments.