Delamination in concrete structures typically arises from the expansion of steel bar corrosion and the separation of the bonding layer. The presence of these delamination defects leads to a reduction in overall structural integrity, safety, and longevity. However, the delamination depth localization remains challenging. To address this issue, this paper introduces a predictive approach that combines time-frequency features extracted from vibration signals with ensemble learning models. The innovation presented in this paper can be summarized in two key aspects: (1) Signal time-frequency spectrograms are transformed into image features using the Histogram of Oriented Gradients (HOG) method, which serves as input attributes. The utilization of ensemble learning models facilitates both rapid and precise depth estimation of delamination signals, especially those with shallow defects. (2) By analyzing the time-frequency spectrogram of the vibration signals, a double exponential model is established between the depth-to-thickness ratio of delamination and the pixel matrix Eulerian distance. This may provide a new perspective for the depth estimation of delamination defects whose corresponding main frequencies are difficult to locate due to the influence of geometry and boundary effects. Laboratory experiments were conducted on concrete slab with simulated delamination defects at different depths to confirm the proposed method's effectiveness.