Abstract Study question Can deep learning accurately evaluate embryo grades and predict clinical pregnancy while providing relevant clinical evidence, not just results from a black box? Summary answer The sophisticated ensemble method can improve the predictive performance for embryo grades and clinical pregnancy, while providing clinically relevant evidence. What is known already Previous studies have shown that AI can predict the IVF outcomes by analyzing the images of embryos. In many literature, AI outperformed human because AI could identify features human eyes could not easily detect. However, clinicians have been cautious to adopt the AI technology due to the black box nature of AI algorithms. In this study, we increased the predictive power of AI as well as providing evidence of the prediction by using deep ensembles and Grad-CAM images. Study design, size, duration We performed a retrospective study of single static images of 727 Day 5 blastocysts from 270 patients who underwent single embryo transfer at a single in vitro fertilization (IVF) clinic between January 2015 and March 2021. The images were collected from standard optical light microscopes and matched with metadata such as embryo grades and pregnancy outcomes. Participants/materials, setting, methods Two different models were designed: an automatic embryo grading model and a pregnancy prediction model. Embryologists labeled a day 5 embryo “GEM,” a good embryo if 4AA/AB or above in the Gardner system and pregnancy was defined as the presence of a fetal heartbeat (FHB). Deep ensembles were applied by training four convolutional neural networks (CNNs) and Grad-CAM images were extracted from the last layer and reviewed by experts. Main results and the role of chance Under several single CNNs, the highest AUROCs of the embryo grading model and the pregnancy prediction model were 0.80 and 0.67, respectively. After applying deep ensembles, the AUROCs of the two models increased to 0.84 and 0.72, respectively. When the F1-score for the positive cases were maximized by adjusting the threshold of ensembles, accuracy, sensitivity and specificity of the embryo grading model were 88.1%, 92.9% and 62.5% respectively. For the pregnancy prediction model, accuracy, sensitivity and specificity were 66.3%, 77.1% and 55.6% respectively. The accuracy of GEM predicting pregnancy for the embryologists and the embryo grading AI model was 47.3% and 59.2%, respectively. It is noteworthy that the AI pregnancy prediction model outperformed the embryologists while successfully auto-grading embryos, a strong evidence that AI considered more features for prediction than what was used for grading. It was also noted from the review of the Grad-CAM images that the both AI models were focusing on the ICM, TE and hatching. Although their area of focus was the same, the pregnancy prediction model was able to make better predictions than the embryologists and the embryo grading model. Limitations, reasons for caution This study has limitations as it is a retrospective study performed on embryo images from a single IVF center. In addition, including other variables such as clinical data may enhance the models. Wider implications of the findings We showed that deep learning can automatically grade embryos and more accurately predict pregnancy than embryologists. Furthermore, the embryologists confirmed the model was looking at key features like ICM, TE and hatching. Sharing such evidence with clinicians can be a necessary step for AI to be adopted for clinical practice. Trial registration number not applicable
Read full abstract