How subjective CT image quality assessment becomes surprisingly reliable: pairwise comparisons instead of Likert scale.

Casper Mihl,Patricia J Nelemans,Razvan L Miclea,Babs M F Hendriks,Frank M Zijta,Walter H Backes,Joachim E Wildberger,Hester A Gietema,Cécile R L P N Jeukens,Eva J I Hoeijmakers,Bibi Martens

doi:10.1007/s00330-023-10493-7

Casper Mihl, Patricia J Nelemans + Show 9 more

Open Access

https://doi.org/10.1007/s00330-023-10493-7

Copy DOI

Abstract

The aim of this study is to improve the reliability of subjective IQ assessment using a pairwise comparison (PC) method instead of a Likert scale method in abdominal CT scans. Abdominal CT scans (single-center) were retrospectively selected between September 2019 and February 2020 in a prior study. Sample variance in IQ was obtained by adding artificial noise using dedicated reconstruction software, including reconstructions with filtered backprojection and varying iterative reconstruction strengths. Two datasets (each n = 50) were composed with either higher or lower IQ variation with the 25 original scans being part of both datasets. Using in-house developed software, six observers (five radiologists, one resident) rated both datasets via both the PC method (forcing observers to choose preferred scans out of pairs of scans resulting in a ranking) and a 5-point Likert scale. The PC method was optimized using a sorting algorithm to minimize necessary comparisons. The inter- and intraobserver agreements were assessed for both methods with the intraclass correlation coefficient (ICC). Twenty-five patients (mean age 61 years ± 15.5; 56% men) were evaluated. The ICC for interobserver agreement for the high-variation dataset increased from 0.665 (95%CI 0.396-0.814) to 0.785 (95%CI 0.676-0.867) when the PC method was used instead of a Likert scale. For the low-variation dataset, the ICC increased from 0.276 (95%CI 0.034-0.500) to 0.562 (95%CI 0.337-0.729). Intraobserver agreement increased for four out of six observers. The PC method is more reliable for subjective IQ assessment indicated by improved inter- and intraobserver agreement. This study shows that the pairwise comparison method is a more reliable method for subjective image quality assessment. Improved reliability is of key importance for optimization studies, validation of automatic image quality assessment algorithms, and training of AI algorithms. • Subjective assessment of diagnostic image quality via Likert scale has limited reliability. • A pairwise comparison method improves the inter- and intraobserver agreement. • The pairwise comparison method is more reliable for CT optimization studies.

Full Text