Surface crack depth detection of metal structures is of great significance to ensure the safe of equipment. However, the commonly used detection methods currently suffer from sensitivity and imaging dimension limitations, which make it difficult to ensure accuracy and feature extraction. To this end, a laser acoustic emission crack depth detection method that combines modal decomposition and deep learning is proposed. The new method is more sensitive to elastic wave changes that detect the depth of microscopic cracks and shows higher accuracy than traditional detection methods. At the same time, frequency-adaptive feature mode decomposition of multivariate signals and color multimodal feature fusion are used to solve the difficulty of time–frequency domain feature mining in current research. The crack depth detection network CDDNet is constructed using an attention-based fusion backbone, a novel cross-modal interaction fusion, and a hierarchical information fusion strategy. Its transfer learning achieves a crack detection accuracy level of 0.05 mm and outperforms other models in terms of accuracy, reaching 93.6 %. This paper shows that the proposed transfer learning laser acoustic emission technology and supporting data processing methods can effectively establish a connection between the experimental and simulation datasets, solving the current challenges of crack depth detection in terms of micro-precision and data processing.