In the realm of automated apple picking operations, the real-time monitoring of apple maturity and diameter characteristics is of paramount importance. Given the constraints associated with feature detection of apples in automated harvesting, this study proposes a machine vision-based methodology for the accurate identification of Fuji apples’ maturity and diameter. Firstly, maturity level detection employed an improved YOLOv5s object detection model. The feature fusion section of the YOLOv5s network was optimized by introducing the cross-level partial network module VoVGSCSP and lightweight convolution GSConv. This optimization aimed to improve the model’s multiscale feature information fusion ability while accelerating inference speed and reducing parameter count. Within the enhanced feature fusion network, a dual attention mechanism combining channel and spatial attention (GAM) was introduced to refine the color and texture feature information of apples and to increase spatial position feature weights. In terms of diameter determination, the contours of apples are obtained by integrating the dual features of color and depth images within the target boxes acquired using the maturity detection model. Subsequently, the actual area of the apple contour is determined by calculating the conversion relationship between pixel area and real area at the current depth value, thereby obtaining the diameter of the apples. Experimental results showed that the improved YOLOv5s model achieved an average maturity level detection precision of 98.7%. Particularly noteworthy was the detection accuracy for low maturity apples, reaching 97.4%, surpassing Faster R-CNN, Mask R-CNN, YOLOv7, and YOLOv5s models by 6.6%, 5.5%, 10.1%, and 11.0% with a real-time detection frame rate of 155 FPS. Diameter detection achieved a success rate of 93.3% with a real-time detection frame rate of 56 FPS and an average diameter deviation of 0.878 mm for 10 apple targets across three trials. Finally, the proposed method achieved an average precision of 98.7% for online detection of apple maturity level and 93.3% for fruit diameter features. The overall real-time inference speed was approximately 56 frames per second. These findings indicated that the method met the requirements of real-time mechanical harvesting operations, offering practical importance for the advancement of the apple industry.