IntroductionMetastases increase the risk of fracture when affecting the femur. Consequently, clinicians need to know if the patient's femur can withstand the stress of daily activities. The current tools used in clinics are not sufficiently precise. A new method, the CT-scan-based finite element analysis, gives good predictive results. However, none of the existing models were tested for reproducibility. This is a critical issue to address in order to apply the technique on a large cohort around the world to help evaluate bone metastatic fracture risk in patients. The aim of this study is then to evaluate 1) the reproducibility 2) the transposition of the reproduced model to another dataset and 3) the global sensitivity of one of the most promising models of the literature (original model). MethodsThe model was reproduced based on the paper describing it and discussion with authors to avoid reproduction errors. The reproducibility was evaluated by comparing the results given in the original model by the original first team (Leuven, Belgium) and the reproduced model made by another team (Lyon, France) on the same dataset of CT-scans of ex vivo femurs. The transposition of the model was evaluated by comparing the results of the reproduced model on two different datasets. The global sensitivity analysis was done by using the Morris method and evaluates the influence of the density calibration coefficient, the segmentation, the orientations and the length of the femur. ResultsThe original and reproduced models are highly correlated (r2 = 0.95), even though the reproduced model gives systematically higher failure loads. When using the reproduced model on another dataset, predictions are less accurate (r2 with the experimental failure load decreases, errors increase). The global sensitivity analysis showed high influence of the density calibration coefficient (mean variation of failure load of 84 %) and non-negligible influence of the segmentation, orientation and length of the femur (mean variation of failure load between 7 and 10 %). ConclusionThis study showed that, although being validated, the reproduced model underperformed when using another dataset. The difference in performance depending on the dataset is commonly the cause of overfitting when creating the model. However, the dataset used in the original paper (Sas et al., 2020a) and the Leuven's dataset gave similar performance, which indicates a lesser probability for the overfitting cause. Also, the model is highly sensitive to density parameters and automation of measurement may minimize the uncertainty on failure load. An uncertainty propagation analysis would give the actual precision of such model and improve our understanding of its behavior and is part of future work.