Defining poor prognosis markers of implantation for embryo selection by time-lapse

Abstract Study question Do AI models for embryo selection provide actual implantation probabilities that generalise across clinics and patient demographics? Summary answer AI models need to be calibrated on representative data before providing reasonable agreements between predicted scores and actual implantation probabilities. What is known already AI models have been shown to perform well at discriminating embryos according to implantation likelihood, measured by area under curve (AUC). However, discrimination performance does not relate to how models perform with regards to predicting actual implantation likelihood, especially across clinics and patient demographics. In general, prediction models must be calibrated on representative data to provide meaningful probabilities. Calibration can be evaluated and summarised by “expected calibration error” (ECE) on score deciles and tested for significant lack of calibration using Hosmer-Lemeshow goodness-of-fit. ECE describes the average deviation between predicted probabilities and observed implantation rates and is 0 for perfect calibration. Study design, size, duration Time-lapse embryo videos from 18 clinics were used to develop AI models for prediction of fetal heartbeat (FHB). Model generalisation was evaluated on clinic hold-out models for the three largest clinics. Calibration curves were used to evaluate the agreement between AI-predicted scores and observed FHB outcome and summarised by ECE. Models were evaluated 1) without calibration, 2) calibration (Platt scaling) on other clinics’ data, and 3) calibration on the clinic’s own data (30%/70% for calibration/evaluation). Participants/materials, setting, methods A previously described AI algorithm, iDAScore, based on 115,842 time-lapse sequences of embryos, including 14,644 transferred embryos with known implantation data (KID), was used as foundation for training hold-out AI models for the three largest clinics (n = 2,829;2,673;1,327 KID embryos), such that their data were not included during model training. ECEs across the three clinics (mean±SD) were compared for models with/without calibration using KID embryos only, both overall and within subgroups of patient age (<36,36-40,>40 years). Main results and the role of chance The AUC across the three clinics was 0.675±0.041 (mean±SD) and unaffected by calibration. Without calibration, overall ECE was 0.223±0.057, indicating weak agreements between scores and actual implantation rates. With calibration on other clinics’ data, overall ECE was 0.040±0.013, indicating considerable improvements with moderate clinical variation. As implantation probabilities are both affected by clinical practice and patient demographics, subgroup analysis was conducted on patient age (<36,36-40,>40 years). With calibration on other clinics’ data, age-group ECEs were (0.129±0.055 vs. 0.078±0.033 vs. 0.072±0.015). These calibration errors were thus larger than the overall average ECE of 0.040, indicating poor generalisation across age. Including age as input to the calibration, age-group ECEs were (0.088±0.042 vs. 0.075±0.046 vs. 0.051±0.025), indicating improved agreements between scores and implantation rates across both clinics and age groups. With calibration including age on the clinic’s own data, however, the best calibrations were obtained with ECEs (0.060±0.017 vs. 0.040±0.010 vs. 0.039±0.009). The results indicate that both clinical practice and patient demographics influence calibration and thus ideally should be adjusted for. Testing lack of calibration using Hosmer-Lemeshow goodness-of-fit, only one age-group from one clinic appeared miscalibrated (P = 0.02), whereas all other age-groups from the three clinics were appropriately calibrated (P > 0.10). Limitations, reasons for caution In this study, AI model calibration was conducted based on clinic and age. Other patient metadata such as BMI and patient diagnosis may be relevant to calibrate as well. However, for both calibration and evaluation on the clinic’s own data, a substantiate amount of data for each subgroup is needed. Wider implications of the findings With calibrated scores, AI models can predict actual implantation likelihood for each embryo. Probability estimates are a strong tool for patient communication and clinical decisions such as deciding when to discard/freeze embryos. Model calibration may thus be the next step in improving clinical outcome and shortening time to live birth. Trial registration number This work is partly funded by the Innovation Fund Denmark (IFD) under File No. 7039-00068B and partly funded by Vitrolife A/S

Download Full-text