Multi-modality learning, exemplified by the language-image pair pre-trained CLIP model, has demonstrated remarkable performance in enhancing zero-shot capabilities and has gained significant attention recently. However, simply applying language-image pre-trained CLIP to medical image analysis encounters substantial domain shifts, resulting in severe performance degradation due to inherent disparities between natural (non-medical) and medical image characteristics. To address this challenge and uphold or even enhance CLIP's zero-shot capability in medical image analysis, we develop a novel approach, Core-Periphery feature alignment for CLIP (CP-CLIP), to model medical images and corresponding clinical text jointly. To achieve this, we design an auxiliary neural network whose structure is organized by the core-periphery (CP) principle. This auxiliary CP network not only aligns medical image and text features into a unified latent space more efficiently but also ensures alignment driven by principles of brain network organization. In this way, our approach effectively mitigates and further enhances CLIP's zero-shot performance in medical image analysis. More importantly, the proposed CP-CLIP exhibits excellent explanatory capability, enabling the automatic identification of critical disease-related regions in clinical analysis. Extensive experiments and evaluation across five public datasets covering different diseases underscore the superiority of our CP-CLIP in zero-shot medical image prediction and critical features detection, showing its promising utility in multimodal feature alignment in current medical applications.