Multi-modality ship image retrieval aims to retrieve ship images from a large dataset, encompassing various modalities, when provided with a query ship image. One of the key challenges is addressing intra-modality variations and cross-modality discrepancies resulting from complex image content and different types of imaging systems. Current multi-modality retrieval methods primarily focus on extracting heterogeneous features from diverse modalities and mapping them into a shared feature space. However, these methods fail to determine the meaningfulness of the extracted features for retrieval tasks, as the features extracted across different modalities exhibit highly complex interactions. To overcome this limitation, we propose a novel framework, dubbed Disentangled Fusion Network (DFN), to decompose ship images into attribute and identity features, capturing modality-specific details and facilitating ship identification. This is achieved through an image disentanglement and fusion module, separating features across different modalities and converting the challenging multi-modality retrieval problem into a more manageable single-modality problem. To enable fine-grained retrieval, a region mining module identifies discriminative areas within ship images. The training process utilizes a bilevel optimization strategy to learn parameters, optimizing both image fusion and retrieval. Extensive experiments conducted on a publicly available multi-modality ship image dataset demonstrate the superior performance of the proposed DFN in image fusion and retrieval tasks.