AbstractBiases have been identified in historical expendable bathythermograph (XBT) datasets, which are one of the major sources of uncertainty in the ocean subsurface database. More than 10 correction schemes were proposed; however, their performance has not been collectively evaluated and compared. This study quantifies how well 10 different available schemes can correct the historical XBT data by comparing the corrected XBT data with collocated reference data in both the World Ocean Database (WOD) 2013 and the EN4 dataset. Four different metrics are proposed to quantify their performances. The results indicate CH14 is the best among the currently available methods, and L09/G12/GR10 can be used with some caveats. To test the robustness of the schemes, we further train the CH14 and L09 by using 50% of the XBT–reference data and the schemes are tested by using the remaining data. The results indicate that the two schemes are robust. Moreover, the EN4 and WOD comparison datasets show a systematic difference of XBT error (~0.01°C on a global scale and 0–700 m on average). influences of quality control and data processing have been investigated. Additionally, the side-by-side XBT–CTD comparison experiment is used to examine the correction schemes and provides independent high-quality data for the assessment. The schemes that best correct the global datasets do not always perform as well at correcting the side-by-side dataset, and further examination of the discrepancy in performance is still required. Finally, CH14 and L09 result in very similar ocean heat content (OHC) change estimates in the upper 700 m since 1966, suggesting the potential of reducing XBT-induced error in OHC estimates.