This paper investigates the applicability of deep learning models for predicting the severity of forest wildfires, utilizing an innovative benchmark dataset called EO4WildFires. EO4WildFires integrates multispectral imagery from Sentinel-2, SAR data from Sentinel-1, and meteorological data from NASA Power annotated with EFFIS data for forest fire detection and size estimation. These data cover 45 countries with a total of 31,730 wildfire events from 2018 to 2022. All of these various sources of data are archived into data cubes, with the intention of assessing wildfire severity by considering both current and historical forest conditions, utilizing a broad range of data including temperature, precipitation, and soil moisture. The experimental setup has been arranged to test the effectiveness of different deep learning architectures in predicting the size and shape of wildfire-burned areas. This study incorporates both image segmentation networks and visual transformers, employing a consistent experimental design across various models to ensure the comparability of the results. Adjustments were made to the training data, such as the exclusion of empty labels and very small events, to refine the focus on more significant wildfire events and potentially improve prediction accuracy. The models’ performance was evaluated using metrics like F1 score, IoU score, and Average Percentage Difference (aPD). These metrics offer a multi-faceted view of model performance, assessing aspects such as precision, sensitivity, and the accuracy of the burned area estimation. Through extensive testing the final model utilizing LinkNet and ResNet-34 as backbones, we obtained the following metric results on the test set: 0.86 F1 score, 0.75 IoU, and 70% aPD. These results were obtained when all of the available samples were used. When the empty labels were absent during the training and testing, the model increased its performance significantly: 0.87 F1 score, 0.77 IoU, and 44.8% aPD. This indicates that the number of samples, as well as their respectively size (area), tend to have an impact on the model’s robustness. This restriction is well known in the remote sensing domain, as accessible, accurately labeled data may be limited. Visual transformers like TeleViT showed potential but underperformed compared to segmentation networks in terms of F1 and IoU scores.