In situ automatic fruit monitoring is of great interest for more accurate and cost-efficient decision making in agriculture. For this purpose, the development of computer vision-based tools is essential. Deep Learning techniques have shown good performance in fruit detection and segmentation. Recently, new models based on Transformer architecture have emerged with promising potential and zero-shot inference capability. In this paper, a Deep Learning model, Mask R-CNN, was trained for on-tree pomegranate fruit segmentation and compared with foundational models based on Vision Transformer, Grounding DINO and Segment Anything Model. Results with Mask R-CNN proved a better performance, according to F1 score and AP metrics, and a lower computational cost, according to prediction time. One of the most interesting derived applications from fruit segmentation is fruit size estimation. However, segmented fruit masks are frequently incomplete due to occlusions. Therefore, image fruit size estimation is not a straightforward process. In this work, we also propose a novel algorithm to estimate and monitor the fruit size in pixel units from the automated masks. A median relative error of 1.39% was obtained, demonstrating the potential and feasibility of future fully-automatic fruit size estimators.