Coping with inter-observer variability when assessing AI for scoring of HER2-stained whole slides.

Florian Thomas,Emmanuelle Guinaudeau,Denys Pommeret,Yahia Salhi,Bruno Poulet,Jérôme Chetritt

doi:10.1200/jco.2024.42.16_suppl.e12571

Florian Thomas, Emmanuelle Guinaudeau + Show 4 more

https://doi.org/10.1200/jco.2024.42.16_suppl.e12571

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

e12571 Background: Around 15% of breast carcinomas exhibit Human Epidermal growth factor receptor 2 (HER2) overexpression due to gene amplification. HER2 protein levels are assessed by Immunohistochemistry (IHC) with 4 scores (0, 1+, 2+, 3+). Only score 3+ (IHC) cases and score 2+ (IHC) with proven HER2 gene amplification are eligible for targeted anti-HER2 therapies; others are HER2-negative. However, tumors with score 1+ and 2+ not amplified could benefit from treatment with the new HER2-based antibody-drug conjugates, requiring an accurate HER2 scoring, which is challenged by interobserver variability. This drawback is challenging for AI-based HER2 scoring assessment, and it is thus important to understand this inter-variability and then be able to propose a probabilistic framework which quantifies and explains the discrepancy between the AI analysis and the pathologist scoring. Methods: We introduce a novel probabilistic framework designed to assign probabilities to scores generated by AI algorithms. To the best of our knowledge, this represents the pioneering effort to quantify the uncertainty surrounding AI predictive models for HER2 scores. The methodology relies on a high-dimensional multivariate Gaussian approximation of the HER2 scoring process, hinting on the precise identification of marked cells to a specified category based on membrane completeness and staining intensity. Introducing uncertainty in the categorical identification enables the quantification of uncertainty in the overall AI-scoring at the whole slide image (WSI) level. Consequently, instead of offering a singular score, we can allocate a probability of belonging to a particular category (0, 1+, 2+, or 3+) for the global score. This approach enables a comparison of the predicted likelihood with the distribution of scores among pathologists, rather than relying on a "hypothetical" ground truth, often determined by a majority-voting which can be noise-sensitive or which excludes ambiguous cases when an agreement is hard to reach. Results: For illustration, we use the Kwant HER2 algorithm developed by DiaDeep on a validation set of 68 HER2 cases, previously scored by three expert pathologists. The framework allows us to understand the discrepancy between pathologists as we introduce nuance to the AI scoring and underlines narrow probability margins in borderline cases, therefore depicting the inherent challenge for pathologists to draw a clear demarcation between closely aligned scores categories. Conclusions: The proposed framework underscores the intrinsic variability of HER2 scoring between pathologists and allows to analyze the AI predictions for HER2 scoring with more nuance, consequently enabling a more critical analysis of the results compared to a singular raw output. As we enter a new era of AI-assisted diagnosis, such a method could improve the way of assessing new AI tools and help interpret their results.

Full Text