Although current deep learning models for bearing fault diagnosis have achieved excellent accuracy, the lack of constraint-guided learning of the physical mechanisms of real bearing failures and a physically scientific training paradigm leads to low interpretability and unreliability of intelligent fault diagnosis models. In this study, a sound-vibration physical-information fusion constraint-guided (PFCG) deep learning (DL) method is proposed, aiming at weighted fusion of sound and vibration multi-physical information into a deep learning model, to guide the DL model to learn more realistic physical laws of bearing failure. Firstly, a 15-degree-of-freedom nonlinear dynamics model of multi-stage degraded bearing failure mechanism with sound-vibration response is developed, which considers the evolutionary mechanism of bearing failure from healthy state to different stages, and utilizes a particle filtering algorithm for dynamic calibration of hidden parameters. Moreover, a lightweight DL fault diagnosis model is designed to realize the deep interaction between the physical model and the DL model through the weighted fusion of the cross-entropy loss function, physical consistency loss and uncertainty loss. Moreover, the superior diagnostic performance of the proposed sound and vibration PFCG-DL model is verified by comparing the performance fluctuations and parameter attributes of different DL benchmark models before and after being guided by physical information fusion constraints (PFCG). Eventually, the proposed PFCG-Transformer model achieves a diagnostic accuracy of 99.45% while keeping the number of parameters at only 0.62M, which significantly improves the accuracy and reduces the computational complexity by 81.5% compared to the CAME-Transformer model's 3.24 M number of parameters and 95.00% diagnostic accuracy. In addition, the test time of PFCG-Transformer is reduced to 1.02 s, which is 60.2% less than CAME-Transformer, demonstrating higher computational efficiency and real-time performance. Importantly, in terms of interpretability, the engineering interpretability and credibility of the models are further improved by visualizing the feature learning results of the vibration modal and multimodal fusion models and the sensitivity analyses of the sound-vibration response models with internal and external physical hyperparameters. Therefore, this study proposes a physical information-guided deep learning method with strong interpretability and superior performance, which provides an important reference for further research and application in the field of bearing fault diagnosis.