Integration of spatiotemporal features into machine learning assessment of open surgical skills.
Integration of spatiotemporal features into machine learning assessment of open surgical skills.
- Research Article
5
- 10.3390/info14010053
- Jan 16, 2023
- Information
Machine learning (ML) techniques discover knowledge from large amounts of data. Modeling in ML is becoming essential to software systems in practice. The accuracy and efficiency of ML models have been focused on ML research communities, while there is less attention on validating the qualities of ML models. Validating ML applications is a challenging and time-consuming process for developers since prediction accuracy heavily relies on generated models. ML applications are written by relatively more data-driven programming based on the black box of ML frameworks. All of the datasets and the ML application need to be individually investigated. Thus, the ML validation tasks take a lot of time and effort. To address this limitation, we present a novel quality validation technique that increases the reliability for ML models and applications, called MLVal. Our approach helps developers inspect the training data and the generated features for the ML model. A data validation technique is important and beneficial to software quality since the quality of the input data affects speed and accuracy for training and inference. Inspired by software debugging/validation for reproducing the potential reported bugs, MLVal takes as input an ML application and its training datasets to build the ML models, helping ML application developers easily reproduce and understand anomalies in the ML application. We have implemented an Eclipse plugin for MLVal that allows developers to validate the prediction behavior of their ML applications, the ML model, and the training data on the Eclipse IDE. In our evaluation, we used 23,500 documents in the bioengineering research domain. We assessed the ability of the MLVal validation technique to effectively help ML application developers: (1) investigate the connection between the produced features and the labels in the training model, and (2) detect errors early to secure the quality of models from better data. Our approach reduces the cost of engineering efforts to validate problems, improving data-centric workflows of the ML application development.
- Preprint Article
- 10.5194/egusphere-egu23-11636
- May 15, 2023
For recent years, Machine Learning (ML) models have been proven to be useful in solving problems of a wide variety of fields such as medical, economic, manufacturing, transportation, energy, education, etc. With increased interest in ML models and advances in sensor technologies, ML models are being widely applied even in civil engineering domain. ML model enables analysis of large amounts of data, automation, improved decision making and provides more accurate prediction. While several state-of-the-art reviews have been conducted in each sub-domain (e.g., geotechnical engineering, structural engineering) of civil engineering or its specific application problems (e.g., structural damage detection, water quality evaluation), little effort has been devoted to comprehensive review on ML models applied in civil engineering and compare them across sub-domains. A systematic, but domain-specific literature review framework should be employed to effectively classify and compare the models. To that end, this study proposes a novel review approach based on the hierarchical classification tree “D-A-M-I-E (Domain-Application problem-ML models-Input data-Example case)”. “D-A-M-I-E” classification tree classifies the ML studies in civil engineering based on the (1) domain of the civil engineering, (2) application problem, (3) applied ML models and (4) data used in the problem. Moreover, data used for the ML models in each application examples are examined based on the specific characteristic of the domain and the application problem. For comprehensive review, five different domains (structural engineering, geotechnical engineering, water engineering, transportation engineering and energy engineering) are considered and the ML application problem is divided into five different problems (prediction, classification, detection, generation, optimization). Based on the “D-A-M-I-E” classification tree, about 300 ML studies in civil engineering are reviewed. For each domain, analysis and comparison on following questions has been conducted: (1) which problems are mainly solved based on ML models, (2) which ML models are mainly applied in each domain and problem, (3) how advanced the ML models are and (4) what kind of data are used and what processing of data is performed for application of ML models. This paper assessed the expansion and applicability of the proposed methodology to other areas (e.g., Earth system modeling, climate science). Furthermore, based on the identification of research gaps of ML models in each domain, this paper provides future direction of ML in civil engineering based on the approaches of dealing data (e.g., collection, handling, storage, and transmission) and hopes to help application of ML models in other fields.
- Research Article
19
- 10.1371/journal.pone.0282608
- Mar 9, 2023
- PLOS ONE
COVID-19 is highly infectious and causes acute respiratory disease. Machine learning (ML) and deep learning (DL) models are vital in detecting disease from computerized chest tomography (CT) scans. The DL models outperformed the ML models. For COVID-19 detection from CT scan images, DL models are used as end-to-end models. Thus, the performance of the model is evaluated for the quality of the extracted feature and classification accuracy. There are four contributions included in this work. First, this research is motivated by studying the quality of the extracted feature from the DL by feeding these extracted to an ML model. In other words, we proposed comparing the end-to-end DL model performance against the approach of using DL for feature extraction and ML for the classification of COVID-19 CT scan images. Second, we proposed studying the effect of fusing extracted features from image descriptors, e.g., Scale-Invariant Feature Transform (SIFT), with extracted features from DL models. Third, we proposed a new Convolutional Neural Network (CNN) to be trained from scratch and then compared to the deep transfer learning on the same classification problem. Finally, we studied the performance gap between classic ML models against ensemble learning models. The proposed framework is evaluated using a CT dataset, where the obtained results are evaluated using five different metrics The obtained results revealed that using the proposed CNN model is better than using the well-known DL model for the purpose of feature extraction. Moreover, using a DL model for feature extraction and an ML model for the classification task achieved better results in comparison to using an end-to-end DL model for detecting COVID-19 CT scan images. Of note, the accuracy rate of the former method improved by using ensemble learning models instead of the classic ML models. The proposed method achieved the best accuracy rate of 99.39%.
- Research Article
16
- 10.1007/s10143-023-02028-x
- May 16, 2023
- Neurosurgical review
Machine learning (ML) models are being actively used in modern medicine, including neurosurgery. This study aimed to summarize the current applications of ML in the analysis and assessment of neurosurgical skills. We conducted this systematic review in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines. We searched the PubMed and Google Scholar databases for eligible studies published until November 15, 2022, and used the Medical Education Research Study Quality Instrument (MERSQI) to assess the quality of the included articles. Of the 261 studies identified, we included 17 in the final analysis. Studies were most commonly related to oncological, spinal, and vascular neurosurgery using microsurgical and endoscopic techniques. Machine learning-evaluated tasks included subpial brain tumor resection, anterior cervical discectomy and fusion, hemostasis of the lacerated internal carotid artery, brain vessel dissection and suturing, glove microsuturing, lumbar hemilaminectomy, and bone drilling. The data sources included files extracted from VR simulators and microscopic and endoscopic videos. The ML application was aimed at classifying participants into several expertise levels, analysis of differences between experts and novices, surgical instrument recognition, division of operation into phases, and prediction of blood loss. In two articles, ML models were compared with those of human experts. The machines outperformed humans in all tasks. The most popular algorithms used to classify surgeons by skill level were the support vector machine and k-nearest neighbors, and their accuracy exceeded 90%. The "you only look once" detector and RetinaNet usually solved the problem of detecting surgical instruments - their accuracy was approximately 70%. The experts differed by more confident contact with tissues, higher bimanuality, smaller distance between the instrument tips, and relaxed and focused state of the mind. The average MERSQI score was 13.9 (from 18). There is growing interest in the use of ML in neurosurgical training. Most studies have focused on the evaluation of microsurgical skills in oncological neurosurgery and on the use of virtual simulators; however, other subspecialties, skills, and simulators are being investigated. Machine learning models effectively solve different neurosurgical tasks related to skill classification, object detection, and outcome prediction. Properly trained ML models outperform human efficacy. Further research on ML application in neurosurgery is needed.
- Research Article
10
- 10.2196/28749
- Jan 18, 2022
- Journal of Medical Internet Research
BackgroundCrowdsourcing services, such as Amazon Mechanical Turk (AMT), allow researchers to use the collective intelligence of a wide range of web users for labor-intensive tasks. As the manual verification of the quality of the collected results is difficult because of the large volume of data and the quick turnaround time of the process, many questions remain to be explored regarding the reliability of these resources for developing digital public health systems.ObjectiveThis study aims to explore and evaluate the application of crowdsourcing, generally, and AMT, specifically, for developing digital public health surveillance systems.MethodsWe collected 296,166 crowd-generated labels for 98,722 tweets, labeled by 610 AMT workers, to develop machine learning (ML) models for detecting behaviors related to physical activity, sedentary behavior, and sleep quality among Twitter users. To infer the ground truth labels and explore the quality of these labels, we studied 4 statistical consensus methods that are agnostic of task features and only focus on worker labeling behavior. Moreover, to model the meta-information associated with each labeling task and leverage the potential of context-sensitive data in the truth inference process, we developed 7 ML models, including traditional classifiers (offline and active), a deep learning–based classification model, and a hybrid convolutional neural network model.ResultsAlthough most crowdsourcing-based studies in public health have often equated majority vote with quality, the results of our study using a truth set of 9000 manually labeled tweets showed that consensus-based inference models mask underlying uncertainty in data and overlook the importance of task meta-information. Our evaluations across 3 physical activity, sedentary behavior, and sleep quality data sets showed that truth inference is a context-sensitive process, and none of the methods studied in this paper were consistently superior to others in predicting the truth label. We also found that the performance of the ML models trained on crowd-labeled data was sensitive to the quality of these labels, and poor-quality labels led to incorrect assessment of these models. Finally, we have provided a set of practical recommendations to improve the quality and reliability of crowdsourced data.ConclusionsOur findings indicate the importance of the quality of crowd-generated labels in developing ML models designed for decision-making purposes, such as public health surveillance decisions. A combination of inference models outlined and analyzed in this study could be used to quantitatively measure and improve the quality of crowd-generated labels for training ML models.
- Conference Article
2
- 10.1109/compsac54236.2022.00166
- Jun 1, 2022
Modeling in machine learning (ML) is becoming an essential part of software systems in practice. Validating ML applications is a challenging and time-consuming process for developers since the accuracy of prediction heavily relies on generated models. ML applications are written by relatively more data-driven programming based on the blackbox of ML frameworks. If all of the datasets and the ML application need to be individually investigated, the ML debugging tasks would take a lot of time and effort. To address this limitation, we present a novel debugging technique for machine learning applications, called MLDBUG that helps ML application developers inspect the training data and the generated features for the ML model. Inspired by software debugging for reproducing the potential reported bugs, MLDBUG takes as input an ML application and its training datasets to build the ML models, helping ML application developers easily reproduce and understand anomalies on the ML application. We have implemented an Eclipse plugin for MLDBUG which allows developers to validate the prediction behavior of their ML applications, the ML model, and the training data on the Eclipse IDE. In our evaluation, we used 23,500 documents in the bioengineering research domain. We assessed the MLDBUG's capability of how effectively our debugging technique can help ML application developers investi-gate the connection between the produced features and the labels in the training model and the relationship between the training instances and the instances the model predicts.
- Dissertation
1
- 10.32657/10356/171848
- Jan 1, 2023
The thesis evaluates the machine learning (ML) and deep learning (DL) approaches’ performance in accurately detecting glaucoma based on optical coherence tomography tabular data and images from individuals of different ethnicities. While numerous studies have employed ML and DL techniques for glaucoma identification, their performance has not been evaluated across diverse ethnic groups. In addition, a DL approach utilizing the Swin Transformer architecture trained on the thickness map images of the retinal nerve fiber layer (RNFL) was also evaluated. This Swin transformer DL model demonstrated an AUC of 0.97 in the internal testing dataset (Asian) and 0.88 in the external testing dataset (Caucasian). However, like the ML classifiers trained on measured data, the DL approach which was trained on raw thickness map images also exhibited poor reproducibility across different datasets. To address these issues, a cross-sectional study design was employed to investigate both ML and DL’s model performance in glaucoma detection using OCT data from individuals of different ethnicities. The study included 514 Asian participants, consisting of 257 with glaucoma and 257 controls, to develop ML and DL classifiers. The trained classifiers were subsequently evaluated on two separate participant groups comprising 356 Asians and 138 Caucasians. Two machine learning classifiers were created using the two types of RNFL thickness, one using the original values extracted from OCT machines (measured RNFL), and the other generated from the compensation model. The compensation model is a multivariate regression trained on normal individuals. It corrects the 12-clock RNFL thicknesses for multiple demographic and anatomical parameters. Additionally, a deep learning model was developed using the Swin Transformer architecture based on the measured RNFL thickness map images from OCT. Explainable artificial intelligence techniques (CAM and SHAP) were utilized to better interpret the results. Performance metrics such as the area under the receiver operating characteristic curve (AUC), accuracy and sensitivity were employed to examine the effectiveness of different glaucoma detection models. Both machine learning (AUC = 0.96) and deep learning (AUC = 0.97) models demonstrated superior performance compared to the raw measured data (baseline, AUC = 0.93), in the internal testing dataset (Asian). However, in the external testing dataset (Caucasian), ML models utilizing the compensated data (AUC = 0.93) exhibited significantly better performance compared to ML models using the original measured data (AUC = 0.83) and the baseline (AUC = 0.82). Furthermore, the ML and DL models trained on measured data exhibited inadequate generalization ability across different ethnicities, whereas the ML model using the compensated data maintained its performance in the external testing dataset. These findings caution against the indiscriminate application of ML and DL models to patient cohorts of different ethnicities. They also suggest that incorporating the compensation model into the development of ML models may enhance their performance in glaucoma detection across diverse ethnicities. Overall, the study highlights the importance of accounting for anatomical variations across different ethnic groups when developing machine-learning models for glaucoma detection using OCT data.
- Research Article
1
- 10.62487/yyx99243
- Jan 27, 2024
- Web3 Journal: ML in Health Science
Aim: The majority of machine learning (ML) models in healthcare are built on retrospective data, much of which is collected without explicit patient consent for use in artificial intelligence (AI) and ML applications. The primary aim of this study was to evaluate whether clinicians and scientific researchers themselves consent to provide their own data for the training of ML models. Materials and Methods: This survey was conducted through an anonymous online survey, utilizing platforms such as Telegram, LinkedIn, and Viber. The target audience comprised specific international groups, primarily Russian, German, and English-speaking, of clinicians and scientific researchers. These participants ranged in their levels of expertise and experience, from beginners to veterans. The survey centered on a singular, pivotal question: “Do You Consent to the Use of Your Biological and Private Data for Training Machine Learning and AI Models?” Respondents had the option to choose from three responses: “Yes” and “No”. Results: The survey was conducted in January 2024. A total of 119 unique and verified individuals participated in the survey. The results revealed that only 50% of respondents (63 persons) expressed consent to provide their own data for the training of ML and AI models. Conclusion: In the development of ML and AI models, particularly open-source ones, it is crucial to ascertain whether participants are willing to provide their private data. While ML algorithms can transform the nature of data, it is important to remember that the primary owner of this data is the individual. Our findings show that in 50% of the cases, even participants from scientific research and clinical backgrounds – individuals typically accountable for ensuring data quality in AI and ML model development – do not consent to the use of their data in AI and ML settings. This highlights the need for more stringent consent processes and ethical considerations in the utilization of personal data in AI and ML research.
- Research Article
9
- 10.14778/3352063.3352110
- Aug 1, 2019
- Proceedings of the VLDB Endowment
Developing machine learning (ML) applications is similar to developing traditional software --- it is often an iterative process in which developers navigate within a rich space of requirements, design decisions, implementations, empirical quality , and performance . In traditional software development, software engineering is the field of study which provides principled guidelines for this iterative process. However, as of today, the counterpart of "software engineering for ML" is largely missing --- developers of ML applications are left with powerful tools (e.g., TensorFlow and PyTorch) but little guidance regarding the development lifecycle itself. In this paper, we view the management of ML development life-cycles from a data management perspective. We demonstrate two closely related systems, ease.ml/ci and ease.ml/meter, that provide some "principled guidelines" for ML application development: ci is a continuous integration engine for ML models and meter is a "profiler" for controlling overfitting of ML models. Both systems focus on managing the "statistical generalization power" of datasets used for assessing the quality of ML applications, namely, the validation set and the test set . By demonstrating these two systems we hope to spawn further discussions within our community on building this new type of data management systems for statistical generalization.
- Conference Article
3
- 10.1109/icidca56705.2023.10100252
- Mar 14, 2023
Machine learning in medical applications is one of the focus areas of the researchers these days. Machine Learning with the application of Artificial Intelligence is not only giving solutions to the complex problems but also revolutionised the medical field. The main motive of machine learning is to improve its learning process over time by taking all the relevant data and information in the form of different inputs and observations. This study reviews different medical disease prediction and detection techniques with the help of distinct deep learning & machine learning models. The problems related to medical diseases, like cancer related diseases, heart, lung, thyroid and kidney diseases are being discussed in this article. Detection and analysing of medical diseases is one of the prominent applications of machine and deep learning. Deep learning as a technology offers a huge set of different and innovative tools which are relevant to different issues faced in the field of medical image processing. This study will discuss about the applications of Machine Learning, and then discuss some of the advancements done in different diseases like breast cancer, heart disease, skin disease, kidney disease etc.
- Research Article
29
- 10.1016/j.jhazmat.2023.133196
- Dec 8, 2023
- Journal of Hazardous Materials
Machine learning-based water quality prediction using octennial in-situ Daphnia magna biological early warning system data
- Research Article
45
- 10.1097/corr.0000000000001360
- Jul 30, 2020
- Clinical Orthopaedics & Related Research
Machine learning (ML) is a subdomain of artificial intelligence that enables computers to abstract patterns from data without explicit programming. A myriad of impactful ML applications already exists in orthopaedics ranging from predicting infections after surgery to diagnostic imaging. However, no systematic reviews that we know of have compared, in particular, the performance of ML models with that of clinicians in musculoskeletal imaging to provide an up-to-date summary regarding the extent of applying ML to imaging diagnoses. By doing so, this review delves into where current ML developments stand in aiding orthopaedists in assessing musculoskeletal images. This systematic review aimed (1) to compare performance of ML models versus clinicians in detecting, differentiating, or classifying orthopaedic abnormalities on imaging by (A) accuracy, sensitivity, and specificity, (B) input features (for example, plain radiographs, MRI scans, ultrasound), (C) clinician specialties, and (2) to compare the performance of clinician-aided versus unaided ML models. A systematic review was performed in PubMed, Embase, and the Cochrane Library for studies published up to October 1, 2019, using synonyms for machine learning and all potential orthopaedic specialties. We included all studies that compared ML models head-to-head against clinicians in the binary detection of abnormalities in musculoskeletal images. After screening 6531 studies, we ultimately included 12 studies. We conducted quality assessment using the Methodological Index for Non-randomized Studies (MINORS) checklist. All 12 studies were of comparable quality, and they all clearly included six of the eight critical appraisal items (study aim, input feature, ground truth, ML versus human comparison, performance metric, and ML model description). This justified summarizing the findings in a quantitative form by calculating the median absolute improvement of the ML models compared with clinicians for the following metrics of performance: accuracy, sensitivity, and specificity. ML models provided, in aggregate, only very slight improvements in diagnostic accuracy and sensitivity compared with clinicians working alone and were on par in specificity (3% (interquartile range [IQR] -2.0% to 7.5%), 0.06% (IQR -0.03 to 0.14), and 0.00 (IQR -0.048 to 0.048), respectively). Inputs used by the ML models were plain radiographs (n = 8), MRI scans (n = 3), and ultrasound examinations (n = 1). Overall, ML models outperformed clinicians more when interpreting plain radiographs than when interpreting MRIs (17 of 34 and 3 of 16 performance comparisons, respectively). Orthopaedists and radiologists performed similarly to ML models, while ML models mostly outperformed other clinicians (outperformance in 7 of 19, 7 of 23, and 6 of 10 performance comparisons, respectively). Two studies evaluated the performance of clinicians aided and unaided by ML models; both demonstrated considerable improvements in ML-aided clinician performance by reporting a 47% decrease of misinterpretation rate (95% confidence interval [CI] 37 to 54; p < 0.001) and a mean increase in specificity of 0.048 (95% CI 0.029 to 0.068; p < 0.001) in detecting abnormalities on musculoskeletal images. At present, ML models have comparable performance to clinicians in assessing musculoskeletal images. ML models may enhance the performance of clinicians as a technical supplement rather than as a replacement for clinical intelligence. Future ML-related studies should emphasize how ML models can complement clinicians, instead of determining the overall superiority of one versus the other. This can be accomplished by improving transparent reporting, diminishing bias, determining the feasibility of implantation in the clinical setting, and appropriately tempering conclusions. Level III, diagnostic study.
- Research Article
17
- 10.1016/j.arthro.2022.06.032
- Jul 9, 2022
- Arthroscopy: The Journal of Arthroscopic & Related Surgery
Machine Learning Can Accurately Predict Overnight Stay, Readmission, and 30-Day Complications Following Anterior Cruciate Ligament Reconstruction
- Research Article
55
- 10.1038/s41598-023-31340-1
- Mar 15, 2023
- Scientific Reports
Dermatological conditions are a relevant health problem. Machine learning (ML) models are increasingly being applied to dermatology as a diagnostic decision support tool using image analysis, especially for skin cancer detection and disease classification. The objective of this study was to perform a prospective validation of an image analysis ML model, which is capable of screening 44 skin diseases, comparing its diagnostic accuracy with that of General Practitioners (GPs) and teledermatology (TD) dermatologists in a real-life setting. Prospective, diagnostic accuracy study including 100 consecutive patients with a skin problem who visited a participating GP in central Catalonia, Spain, between June 2021 and October 2021. The skin issue was first assessed by the GPs. Then an anonymised skin disease picture was taken and uploaded to the ML application, which returned a list with the Top-5 possible diagnosis in order of probability. The same image was then sent to a dermatologist via TD for diagnosis, as per clinical practice. The GPs Top-3, ML model’s Top-5 and dermatologist’s Top-3 assessments were compared to calculate the accuracy, sensitivity, specificity and diagnostic accuracy of the ML models. The overall Top-1 accuracy of the ML model (39%) was lower than that of GPs (64%) and dermatologists (72%). When the analysis was limited to the diagnoses on which the algorithm had been explicitly trained (n = 82), the balanced Top-1 accuracy of the ML model increased (48%) and in the Top-3 (75%) was comparable to the GPs Top-3 accuracy (76%). The Top-5 accuracy of the ML model (89%) was comparable to the dermatologist Top-3 accuracy (90%). For the different diseases, the sensitivity of the model (Top-3 87% and Top-5 96%) is higher than that of the clinicians (Top-3 GPs 76% and Top-3 dermatologists 84%) only in the benign tumour pathology group, being on the other hand the most prevalent category (n = 53). About the satisfaction of professionals, 92% of the GPs considered it as a useful diagnostic support tool (DST) for the differential diagnosis and in 60% of the cases as an aid in the final diagnosis of the skin lesion. The overall diagnostic accuracy of the model in this study, under real-life conditions, is lower than that of both GPs and dermatologists. This result aligns with the findings of few existing prospective studies conducted under real-life conditions. The outcomes emphasize the significance of involving clinicians in the training of the model and the capability of ML models to assist GPs, particularly in differential diagnosis. Nevertheless, external testing in real-life conditions is crucial for data validation and regulation of these AI diagnostic models before they can be used in primary care.
- Research Article
1
- 10.1093/humrep/deac105.025
- Jun 29, 2022
- Human Reproduction
Study question Can a machine learning (ML) model, developed using modern neural network architecture produce comparable annotation data; utilisable for algorithmic outcome prediction, to manual time-lapse annotations? Summary answer The model automatically annotated unseen embryos with comparable results to manual methods, generating morphokinetic data to enable comparably predictive outputs from an embryo selection algorithm. What is known already The application of artificial intelligence across healthcare industries, including fertility, is increasing. Several ML models are available that seek to generate or analyse embryo images and morphokinetic data, and to determine embryo viability potential. Along with photographic images, the use of time-lapse in IVF laboratories has amassed numeric data, resulting predominantly from annotated manual assessment of images over time. Embryo annotation practice is variable in quality, can be subjective and is time-consuming; commonly taking several minutes per embryo. The development of rapid, accurate automatic annotation would represent a significant time-saving as well as an increase in reproducibility and accuracy. Study design, size, duration Multicentre quality assured annotation data from 63,383 time-lapse monitored embryos (EmbryoScope®), comprising over 400 million individual images, were used to train a ML model to automatically generate morphokinetic annotations. Data was derived from 8 UK clinics within a cohesive group between 2012-2021. Accuracy was assessed using 900 unseen embryos (with live birth outcome) by comparing the output of an established in-house, prospectively validated embryo selection model when the input was either ML-automated, or manual annotations. Participants/materials, setting, methods Multi-focal plane images were processed on the Azure cloud (Microsoft) and resampled to 300x300 pixels. A Laplacian-based focal stacking algorithm merged frames into a single image. The model consisted of an EfficientNetB4 Convolutional Neural Network classifier to extract features and classify the stage of embryo images. A Temporal Convolutional Network interpreted a time-series of image features; producing annotations from pronuclear fading through to blastocyst. Soft localisation loss function used QA data to integrate annotation subjectivities. Main results and the role of chance The ML model rapidly and automatically generated annotations. Efficacy and comparability of the ML model to automate reliable, utilisable annotations was demonstrated by comparison with manual annotation data and the ML model’s ability to auto-generate annotations which could be used to predict live birth by providing annotation data to an established, validated in house embryo selection model. Live birth-predictive capability was measured, and benchmarked against manual annotation, using the area under the receiver operating characteristic curve (AUC). When tested on time-lapse images, collected from pronuclear fading to full blastulation, representing 900 previously unseen, transferred blastocysts where live birth outcomes were blinded, the in-house developed auto-annotation ML model resulted in an AUC of 0.686 compared with 0.661 for manual annotations, for live birth prediction. Auto annotation using the developed model took only milliseconds to complete per embryo. The developed auto-annotation model, built and tested on large data, is considered suitable for productionisation with the aim of being validated and integrated into an application to support IVF laboratory practice. Limitations, reasons for caution Whilst this model was trained to recognise key morphokinetic events, there are other morphokinetic variables that may be useful in the prediction of live birth and further improve embryo selection, or deselection, ability. Akin to manual interpretation, some embryos may fail to be annotated or need second opinion. Wider implications of the findings There is increasing evidence supporting the application of ML to utilise big data from time-lapse imaging and fertility care generally. Whilst promising benefits to IVF clinics and patients, responsible use of data is required alongside large high-quality datasets, and rigorous validation, to ensure safe and robust applications. Trial registration number N/A