Abstract

Abstract Study question Can ERICA’s deep-learning capabilities allow it to learn specifics from individual clinics, and improve its performance through a quality assurance and fine-tuning process? Summary answer Quality assurance and fine-tuning allowed ERICA to adapt to unique specifications of individual clinics, resulting in an improved performance at each clinic. What is known already Machine learning (ML) solutions to real-life problems have shown that generalizability (applicability of a model to different scenarios) of a single model is fundamentally a suboptimal approach, due to the risk of underspecification. Under-specification becomes relevant in environments where there is a myriad of protocols and approaches, like during IVF treatments. It is naïve to assume that different features extracted from embryos to predict treatment success weigh the same along the overall heterogeneity of protocols. This underspecification problem takes special relevance when deploying an ML-based product, like ERICA, in a clinical setting. Study design, size, duration Retrospective analysis of results from the quality assurance (QA) and fine-tuning (adaptation) process performed for a deep learning algorithm named ERICA (Embryo Ranking Intelligent Classification Assistant) at five clinics (1879 embryos) between August and September 2020. Participants/materials, setting, methods QA and fine-tuning consist of a transfer-learning approach (of the ERICA Core model) and re-training using embryos of each clinic exclusively. Results are assessed by a 10-fold cross validation approach, which splits the database in 10 and iteratively validates on each by training on the rest. Performance of ERICA is assessed both before and after the fine-tuning process and results are presented as averages per clinic. Embryos considered for QA and fine-tuning had known outcome. Main results and the role of chance After the fine-tuning, ERICA showed an average improvement of 13% in accuracy (from 50.2% to 63.2%); 36.6% in specificity (from 22.4% to 59%); 11% for Positive Predictive Value (from 51% to 62); 19.6% for Negative Predictive Value (from 44.6% to 64.2%), and 3.4% for F1 score (from 60% to 63.4%). Sensitivity decreased from 78% to 65.4%. Our results suggest ERICA’s Core is robust lending itself to be fine-tuned. It learns from individual laboratory specifics, and in this way adapts to new clinics. The results demonstrate that the Core model tends to classify embryos from new clinics as having a good prognosis, since it showed a high sensitivity and low specificity, both showing an improved balance following the fine-tune process. Additionally, the probability of finding a good prognosis embryo in the different labels, behaved as expected, decreasing its probability from Optimal (65.8%) to Poor prognosis (37.4%). Limitations, reasons for caution underspecification is a challenge to Artificial Intelligence (AI) based solutions pursuing a general model. For this study, our approach of QA followed by a fine-tuning process to overcome underspecification, was successful. However, it was only applied to 5 clinics, and the findings remain to be proven on a larger scale. Wider implications of the findings: Performance of QA should be considered standard before clinical implementation of any AI based solution. Our results should be interpreted as the theoretical/expected future performance of ERICA for each clinic. Regular assessments on performance for all models generated after fine-tuning are encouraged. Trial registration number Not applicable

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call