Advancements in computer vision and deep learning have led to difficulty in distinguishing the generated Deepfake media. In addition, recent forgery techniques also modify the audio information based on the forged video, which brings new challenges. However, due to the cross-modal bias, recent multimodal detection methods do not well explore the intra-modal and cross-modal forgery clues, which leads to limited detection performance. In this paper, we propose a novel audio-visual aware multimodal Deepfake detection framework to magnify intra-modal and cross-modal forgery clues. Firstly, to capture temporal intra-modal defects, Forgery Clues Magnification Transformer (FCMT) module is proposed to magnify forgery clues based on sequence-level relationships. Then, the Distribution Difference based Inconsistency Computing (DDIC) module based on Jensen–Shannon divergence is designed to adaptively align multimodal information for further magnifying the cross-modal inconsistency. Next, we further explore spatial artifacts by connecting multi-scale feature representation to provide comprehensive information. Finally, a feature fusion module is designed to adaptively fuse features to generate a more discriminative feature. Experiments demonstrate that the proposed framework outperforms independently trained models, and at the same time, yields superior generalization capability on unseen types of Deepfake.