Abstract Study question Can a generalizable computer vision foundation model be trained using self-supervised learning on time-lapse embryo images to perform multiple clinically relevant downstream tasks? Summary answer We developed FEMI, a foundation model trained on eight million time-lapse images, to perform multiple clinical tasks, including blastocyst quality scoring, ploidy prediction, and segmentation. What is known already In vitro fertilization success critically depends on choosing viable embryos, a process hampered by current limited diagnostic tools, high costs, and ethical concerns. In recent years, various medical fields have increasingly explored foundation models using vision transformer architectures. These models, trained in a self-supervised manner on vast unlabeled image datasets, perform various clinically relevant tasks. Once trained on a massive unlabeled dataset, the encoder portion of the foundation model can be extracted and subsequently fine-tuned with a labeled dataset to perform various clinically relevant tasks like classification, regression, and segmentation. Study design, size, duration The foundation model was trained on 8 million Embryoscope® (E-SD) and Embryoscope+® (E+) images spanning clinics in the United States, Canada, and Europe. Datasets contained various additional information, such as ploidy status, blastocyst scores (numerical values from 3 to 14 based on Zhan et al. 2020), and segmentation masks (for trophectoderm [TE], inner cell mass [ICM], and zona pellucida [ZP]) that were used for training downstream tasks. Participants/materials, setting, methods A masked autoencoder architecture is trained on time-lapse images to create a foundation model (FEMI). The encoder from the autoencoder is extracted and fine-tuned on three image-based downstream tasks: ploidy prediction, blastocyst scoring, and embryo component segmentation. The performance of the fine-tuned foundation model is compared to VGG16 architectures, specifically trained for each downstream task. We evaluated performances using the area under the receiver-operating-characteristic (AUROC), mean absolute error (MAE), and mean intersection over union (mIoU). Main results and the role of chance The training dataset for the downstream tasks consisted of image data and labels, including PGT-A ploidy results (euploid or aneuploid), blastocyst scores (3-14), and segmentation masks. No clinical data like maternal age was used in order to assess FEMI’s learning capabilities from only time-lapse images against VGG16 model baselines. Both FEMI and baseline models used the same data splits for consistent comparisons. For ploidy prediction, models trained on a Weill Cornell Medicine (WCM) dataset were validated using WCM, Florida, and Spain data. FEMI achieved a 0.610 ± 0.004 AUROC, outperforming the baseline’s 0.590 ± 0.005. In blastocyst score prediction, FEMI attained a superior MAE of 0.099 ± 0.002 compared to the baseline’s 0.118 ± 0.001. For embryo segmentation (TE, ICM, ZP), a publicly available dataset from Simon Fraser University was split for training and validation. FEMI exceeded the baseline in mIoU for TE segmentation (0.738 ± 0.002 vs. 0.726 ± 0.002) and was comparable in ICM (0.832 ± 0.008 vs. 0.821 ± 0.011) and ZP segmentation (0.779 ± 0.002 vs. 0.778 ± 0.010). Limitations, reasons for caution The current version of FEMI is trained on a subset of our available datasets and can be improved through further training. Moreover, the downstream ploidy and blastocyst score models may be biased by errors in PGT results and the subjectivity of embryologists, respectively. Wider implications of the findings A generalizable foundational model for IVF time-lapse imaging aids clinicians in embryo selection by providing various clinical insights for better decision-making. FEMI, once fully trained, will be publicly accessible to the scientific community allowing for researchers to fine-tune FEMI on their own clinic-specific applications with their own image datasets. Trial registration number not applicable