As three-dimensional integrated circuit (3D-IC) chip technology advances, thermal management has become increasingly important because of increasing heat flux from thermal stacking. Micro-pin fin-embedded cooling has emerged as a promising solution for 3D-ICs, offering better thermal and hydraulic performance than conventional microchannel heat sinks. It is also easy to integrate into existing 3D-IC structures, such as through-silicon vias between stacks. The utilization of two-phase flow in micro-pin fins further enhances temperature uniformity and improves the heat transfer coefficient by leveraging latent heat. Nevertheless, predicting thermal performance in micro-pin fin heat sinks under boiling conditions remains challenging owing to intricate geometric shapes and diverse operating conditions. The present lack of correlation or theoretical models poses a significant obstacle. To address this problem, our study employed a Multimodal machine-learning (ML) approach, combining image data capturing boiling patterns of two-phase flow and information about geometric shape and operating conditions, to predict heat transfer characteristics in micro-pin fin heat sinks. We utilized experimental data comprising 155 types of boiling heat transfer data with the dielectric fluid FC-72 in two micro-fin shapes directly etched on Si. Four ML algorithms (XGBoost, LightGBM, Multilayer perceptron (MLP), and Multimodal ML) were employed to predict thermal performance. The correlation coefficient analysis before learning revealed the influence of each type of measurement data on the heater surface temperature during two-phase flow. Prediction accuracy was measured using mean absolute percent error (MAPE), and the results were compared in terms of maximum and average temperature depending on the characteristics of each ML algorithm. Overall, the Multimodal approach demonstrated superior capability in predicting temperature distributions with spatial details, surpassing conventional decision-tree algorithms and MLP in performance. When trained with boiling images, the Multimodal ML model achieved remarkable precision, evidenced by a MAPE of 1.81% for the maximum temperature and 0.84% for the average temperature, highlighting its exceptional accuracy in mapping the heated surface temperature profile. By contrast, the traditional MLP model, which lacked training on boiling images, showed diminished accuracy, with a MAPE of 2.54% for the maximum temperature and 1.77% for the average temperature, indicating a comparative shortfall against the Multimodal model results.