Abstract Study question Is an AI model as good as an experienced embryologist? Summary answer Properly trained AI models can perform as good as embryologists with respect to accuracy, improving in the same time decisiveness. What is known already There are many attempts to solve the embryo selection problem. One way is to determine the specific features of the embryo and on this basis calculate the final score. The non-algorithmic approach uses the professional knowledge of the embryologist to perform the visual analysis and score the embryos. The most promising attempts are using deep learning, specifically CNNs to directly predict pregnancy probabilities from a given image or set of images. Although these tools deliver high quality answers they have rather low intra-embryologist agreement. Study design, size, duration Comparing scoring of AI algorithms with embryologists is a challenge, as they miss a common scale, e.g., total ranking of embryos. In order to overcome this problem we have designed a test containing150 pairs of day-5 embryo time-lapses. For each pair of embryos only one gave the pregnancy (implantation based on beta-hCG). We compared our algorithm with the decisions of 10 embryologists with 10 years of experience on average. Participants/materials, setting, methods We have created a web questionnaire for the test. It displayed time-lapses for a pair of embryos and allowed the embryologists to choose the more promising one. We have invited doctors from several clinics to take part in the study. The AI model was tested on the same data, i.e., its goal was to choose between two transferred embryos. After collection of data, the effectiveness of the embryologists and the model were compared. Main results and the role of chance The results of the comparison are as follows. The accuracy of predicting the embryo that gave the pregnancy was: - 66.9 (CI 63.1 - 70.7) for our model, - 63.8 (CI 62.6 - 65.0) on average for the embryologists. The decisions taken by the algorithm are slightly better, however, this holds with rather low statistical significance. Some decisions taken by the doctors have high variance, e.g., there were cases where 5out of 10 decisions indicated one embryo. In order to understand these variances better, we have divided the test-set into two parts: a) 57 cases where all doctors agreed on the decision, b) 93 cases where there were some differences. On the a) set the decisions of the algorithm agreed with the experts in 95% of cases. While for the set b) the correlation between expert decisions or the algorithms with the ground truth was rather weak,i.e., p-score of approximately 0.1. The last aspect of this study was the time of making the decision. The average time for all experts was 54 seconds for each decision, while our algorithm took decisions in 2 seconds on average. Limitations, reasons for caution The experiment shows high agreement between algorithms and experts in the case when experts agree. However, the difference between the average accuracy scores shows low statistical significance. Wider implications of the findings The model returns the result of the analysis almost immediately, thus it can speed up the process of selecting the most significant embryos. The model agrees with the experts in the case when experts agree. Trial registration number not applicable