Abstract Accurate assessment of pulmonary edema severity in acute decompensated congestive heart failure (CHF) patients is vital for treatment decisions. Traditional methods face challenges due to the complexity of chest X-ray (CXR) and unstructured radiology reports. We proposed a method combining self-supervised learning and multimodal cross-attention to address these challenges. Dual-mechanic self-supervised pre-training enhances feature extraction using contrastive learning between text and image features, and generative learning between images. A bidirectional multi-modal cross-attention model integrates image and text information for fine-tuning, improving model performance. Four CXR datasets consisting of 519, 437 images were used for pre-training; 1200 randomly selected image-text pairs were used for fine-tuning and partitioned into train, validation, and test sets at 3: 1: 1. Ablation studies for pre-training and fine-tuning approaches indicated their practicality as evidenced by the optimal macro F1 score of 0.667 and optimal macro-AUC of 0.904. It also outperformed other state-of-the-art multi-modality methods. The novel approach could accurately assess pulmonary edema severity, offering crucial support for CHF patient management.