Firmness is a critical indicator for predicting fruit ripeness, optimal harvest date, and shelf life. In this study, a novel fruit acoustic real-time detection prototype device and a conventional visible near-infrared (Vis/NIR) spectroscopy real-time detection device were used to collect acoustic and spectral signals from yellow flesh peaches to jointly predict their firmness. The acoustic and optical signals were generated into one- and two-dimensional feature data by complete ensemble empirical mode decomposition with adaptive noise (CEEMDAN), continuous wavelet transform (CWT) and Gramian angular field (GAF) data processing methods. Based on these data, a variety of yellow flesh peach firmness prediction models were constructed in this study, including partial least square (PLS), support vector regression (SVR), Swin Transformer (SwinT), and SwinT-PLS/SVR. The experimental results showed that the SwinT-PLS model based on the fusion of competitive adaptive re-weighted sampling (CARS)-acoustic image features and CARS-Vis/NIR spectral features showed the best prediction performance (R2P = 0.951, the RMSEP = 0.443 N/mm, RPDP = 4.339), and the prediction performance is significantly higher than that of the prediction model based on single acoustic and Vis/NIR spectral data. The method proposed can fast, non-destructively, accurately predict fruit firmness and has excellent prospects for commercial real-time fruit sorting applications.