Abstract Study question Deep-learning algorithms are known to be non-robust: can the variability and inconsistency of AI algorithms be reduced in embryo selection? Summary answer We reduced the variability of algorithms (measured on different tasks like rotations and brightness changes) by 86% while preserving their quality. What is known already Deep-learning methods are generally known to be non-robust, i.e., decisions change with even slight modification of input data. Current solutions for embryo scoring are not robust - for example rotating the input image results in a different score in most solutions on the market. Despite this fact and expressed concerns of embryologists, there are no other publications focusing on the problem of variance in AI solutions used in IVF. Most of the publications measure accuracy, sensitivity, specificity, and ROC AUC; there are no variance metrics. Study design, size, duration The data-set was collected within multiple clinics using various devices. It contains 34,821 embryos (4,510 were transferred with known pregnancy results), represented by time-lapse videos or images. This gives 3,290,481 frames of embryos at various maturity levels. From the data-set 925 randomly selected embryos were chosen as a test set. The frames were modified by methods that are not supposed to change the results of the algorithm. We measured the variability of the scores given by our algorithm. Participants/materials, setting, methods We have considered seven different modifications of images that should not influence embryo scoring: • Rotations (10 different angles); • Brightness and Contrast modifications; • Substitutions of Frames (from time-lapse monitoring taken from a 2 hours interval); • Blur (Generalised Normal filter); • Gaussian Noise; • Gaussian Blur; • Sharpening. We used several techniques to reduce variance of our deep neural network model (architecture commonly used for embryo selection): • Ensemble (of different models in cross validation); • Test time augmentation (TTA); • Robust training. Main results and the role of chance In order to measure the variance we have used the following method. First, the scores are stretched to the standard uniform distribution. In other words we look in which percentile the score lies. This way the range of the scores are normalised thus the variance can be compared. Second, we train the EMBROAID model on the augmented data that includes all the above modifications. Third, we compute the variance of the normalised scores on the test set. The mean variance dropped by 86% (0.0055 to 0.0008) across all measured input modifications. The individual drops in the variance on measured input modifications: Rotations: 77% (0.009 -> 0.002), Brightness and Contrast: 81% (0.0036 -> 0.0007), Substitution of Frames: 76% (0.0076 -> 0.0019), Blur 94% (0.012 -> 0.0008), Gaussian Noise: 96% (0.0049 -> 0.0002), Gaussian Blur: 95% (0.0052 -> 0.0003), Sharpening: 77% (0.0015 -> 0.0003). The significance was tested with Wilcoxon Rank Sum Test giving the p-value < 0.01 on all input modifications. Finally, we stress that these results were obtained without any loss in the ROC AUC metric. We have tested the algorithm both on the original test-set. Both models achieved an ROC AUC of 0.66 (CI 0.63-0.69) on both test-sets. Limitations, reasons for caution Further work needs to be done to extend the set of possible augmentations of data. Wider implications of the findings Increased reliability of AI scoring algorithms for embryo selection. It is possible to obtain consistent results over a wide range of data modifications. Trial registration number not applicable