O-285 Artificial intelligence algorithms reach expert-level accuracy in automated grading of blastocyst morphology assessment based on static embryo images and Gardner criteria

F Kromp,B Balaban,B Kovacic,I Martínez Rodero,L Parmegiani,M Fawzy,V Cottin,D Ljiljak,O Shebl,R Wagner,N Findikli,P Fancsovits,M Xie,T Ebner,I Cuevas Saiz

doi:10.1093/humrep/deac106.078

Abstract

Abstract Study question Can artificial intelligence (AI) algorithms reach expert-level accuracy in blastocyst morphology assessment according to Gardner criteria? Summary answer The prediction accuracy of the best performing AI algorithm (Deit), outperformed human-level mean accuracies compared to an embryologist majority vote for all Gardner morphological criteria. What is known already Routinely, morphological grading of blastocysts is performed visually according to Gardner criteria, which suggest expansion (EXP), quality of inner cell mass (ICM), and trophectoderm (TE) as key parameters to predict treatment outcome. Consequently, blastocyst scoring is prone to inter-and intra-observer variability, which may lead to inconsistencies in selecting blastocysts for transfer. AI-based algorithms may help to improve treatment outcome predictability, as it has been suggested recently. In those studies, parameters such as blastocyst quality or stage were annotated by experts from static or time-lapse-derived blastocyst images, to train AI algorithms, e.g. XCeption or YOLO, and compare them to human annotators. Study design, size, duration This retrospective study involves 2,270 images from 837 patients collected over a period of four years in a university IVF clinic. Participants/materials, setting, methods All images were annotated by one senior embryologist and divided into a training and a balanced test set. Subsequently, eight embryologists labeled 300 test set images such that every single image was seen by at least four embryologists. Annotators diverging from the ensemble vote for more than one standard deviation were excluded (n = 2) to set the ground truth labels. Finally, three AI architectures (XCeption, Swin, Deit) were trained and evaluated on that particular ground truth. Main results and the role of chance Out of nine annotators, labelling accuracy of two embryologists diverged from the consensus vote for more than one standard deviation for at least one of the three Gardner criteria. The consensus vote was built from the remaining seven annotators (mean accuracy EXP 0.81, ICM 0.70, TE 0.67). The Swin architecture outperformed the mean expert accuracy for all three criteria (EXP 0.82, ICM 0.76, TE 0.68), while the Deit and the XCeption architecture outperformed the mean expert accuracy in ICM accuracy (Deit 0.72, XCeption 0.73), and performed equal or worse in EXP and TE accuracy (Deit EXP 0.77, ICM 0.73; XCeption EXP 0.77, TE 0.66). When compared to a recent study conducted on time-lapse imaging data using AI algorithms, all our models outperform the ICM accuracy and achieve comparable TE accuracy. To minimize the role of chance in calculating the models' prediction accuracies, the SWA-Gaussian (SWAG) algorithm was used. SWAG is a method to reflect and calibrate uncertainty representation in Bayesian deep learning. It is based on modelling a Gaussian distribution for each networks' weight and applying it as a posterior over all neural network weights to perform Bayesian model averaging. Limitations, reasons for caution To reflect a real IVF lab scenario, embryologists of different origins and levels of experience were involved and no scoring training was offered to the participants. These facts could have potentially negatively affected the degree of consensus, although we excluded two annotators diverging from the mean labeling accuracy. Wider implications of the findings In the past, AI algorithms proved to reliably differentiate between good and bad prognosis blastocysts but not necessarily between blastocysts of similar quality. Further AI-supported differentiation on the basis of expansion and cell lineages will facilitate the ranking of blastocysts and would bring automated scoring closer to clinical application. Trial registration number Not applicable.

Full Text