Abstract Study question Can a deep learning model accurately characterize ovarian features on pelvic ultrasounds? Summary answer We developed a deep learning model, Ovarify, that can accurately classify and segment ovarian features on pelvic ultrasounds. What is known already Ovarian morphology, as detected using pelvic ultrasonography, is used widely in the assessment of gynecological health. Morphological metrics such as ovarian volume and antral follicle count have known diagnostic and prognostic value for a variety of gynecological outcomes, ranging from risk for ovulatory disorders to likelihood of response to assisted reproductive technologies. However, a lack of standardization in the interpretation of ovarian features, and the time-intensive nature of validated image analysis techniques (such as 2D grid analysis), limits their utility in clinical practice. Deep learning provides an opportunity to overcome these limitations. Study design, size, duration The training dataset for the model comprised 426 3D pelvic ultrasounds obtained using the GE Voluson ultrasound platform, yielding 56,538 individual frames with potential ovarian content. In addition, a similarly collected validation dataset consisted of 95 ultrasounds, yielding 11,304 frames. The held-out test set comprised 107 ultrasounds with 14,767 frames. All frames were manually labeled with the associated 2D ovarian contour, and 163 of the ultrasounds had an associated ovarian volume label. Participants/materials, setting, methods Ovarify consists of two stages: 1) a lightweight classifier that labels ultrasound frames as either positive or negative (having or not having ovarian content) and 2) a subsequent U-Net-inspired segmentation module with an Xception backbone and attention-based decoder that segments the ovary and antral follicles in all frames previously classified as positive. Classification performance was evaluated using F1-score and area under the receiver operating characteristic curve (AUROC). Segmentation performance was evaluated using the Jaccard index. Main results and the role of chance For the classification stage, Ovarify achieved an AUROC of 0.980±0.002 and 0.976±0.004 on the training and validation sets, respectively. On the held-out test set, the model achieved an AUROC of 0.971±0.005. In addition, the model achieved an F1-score of 0.825±0.006, 0.810±0.013, and 0.784±0.014 on the training, validation, and held-out testing sets, respectively. The high AUROC and F1-scores across both seen and unseen datasets support a high-performing classifier. In the segmentation stage, Ovarify reached a median Jaccard index of 0.815±0.004 on the unseen test set when tasked with segmenting the ovary, and a median Jaccard index of 0.657±0.018 on the unseen test set when segmenting antral follicles. High agreement between the ground truth ovary and follicle labels and the predicted segmented regions increases confidence in the downstream morphological analysis. For example, the Pearson correlation coefficient between the Ovarify-calculated ovarian volume and manually-validated ovarian volume was 0.813, indicating that an automatic ovarian volume calculation highly aligns with the ground truth. Ovarify was able to provide predictions in under one minute for the entirety of the held-out test set. Limitations, reasons for caution Ovarify was trained on images generated by the GE Voluson platform and has not been evaluated for generalizability across ultrasound images generated using machines whose output does not match the DICOM format standard. Wider implications of the findings The model developed is relevant for automatically calculating ovarian morphological features from pelvic ultrasounds. This approach has implications in removing user-specific bias and error when calculating biomarkers associated with female reproductive health, including but not limited to ovulatory disorders such as polycystic ovary syndrome (PCOS). Trial registration number not applicable
Read full abstract