Abstract
Introduction: Using routine demographic and clinical data, prognostic models can aid clinicians in decision-making by estimating the risk of complications, including gestational diabetes mellitus (GDM). Due to the heterogeneity in patient populations, measurement procedures, and temporal changes, these models must be continuously updated.
Objectives: This study aimed to follow best practices by temporally evaluating three GDM prediction models previously developed in the same region. This included performing an updated temporal validation of the Monash GDM Logistic Regression model ('version 2') and undertaking the first temporal validation of the Monash GDM Machine Learning model (CatBoost classifier; 'version 3 model 1') and an extended logistic regression GDM model ('version 3 model 2').
Methods: We utilised data from 12,722 singleton pregnancies at Monash Health Network from 2021 to 2022 for model temporal evaluation. Model one included six categorical variables, while models 2 and 3 had the same mix of eight categorical and continuous variables. Missing data were handled through multiple imputations. Model performance was assessed using discrimination and calibration. Decision curve analyses (DCA) were performed to determine the net benefit of models. Recalibration was considered to improve model performance. Subgroup or algorithmic fairness was assessed for ethnic groups and parity.
Results: In this temporal evaluation dataset, GDM prevalence was 28.6%, compared to 18% when version 2 and 21.3% when version 3 models were developed. Of the 28.6% with GDM, 33.5% were aged 35 or older, and 62.2% had a BMI > 25 kg/m². There was similar discrimination performance across the models, with areas under the curve (AUCs) of 0.72 [95% CI: 0.71, 0.73], 0.73 [95% CI: 0.72, 0.74], and 0.73 [95% CI: 0.73, 0.74] for version 2 and version 3 model 1 and 2, respectively. All models exhibited overestimation with calibration slopes of 0.87, 0.992, and 0.87, respectively, which improved with recalibration. In DCA, all models had better net benefits across some threshold probabilities than alternatives. For all models, some variability was observed in prediction performance across ethnic groups and parity.
Conclusions: Despite significantly decreased discrimination of machine learning models and some degradation in calibration, all models remained robust, especially after recalibration. Dynamic models are better suited to adapt to the temporal changes in baseline characteristics of pregnant women and the resulting calibration drift, as they can incorporate new data without requiring manual evaluation.