Automated safety management in construction can reduce injuries by identifying hazardous postures, actions, and missing personal protective equipment (PPE). However, existing computer vision (CV) methods have limitations in connecting recognition results to text-based safety rules. To address this issue, this paper presents a multi-modal framework that bridges the gap between construction image monitoring and safety knowledge. The framework includes an image processing module that utilizes CV and dense image captioning techniques, and a text processing module that employs natural language processing for semantic similarity evaluation. Experiments showed a mean average precision of 49.6% in dense captioning and an F1 score of 74.3% in hazard identification. While the proposed framework demonstrates a promising multi-modal approach towards automated safety hazard identification and reasoning, improvements in dataset size and model performance are still needed to enhance its effectiveness in real-world applications.