The development of valuable artificial intelligence (AI) tools to assist with ultrasound diagnosis depends on algorithms developed using high-quality data. This study aimed to test the intra- and interobserver agreement of a proposed image-quality scoring system to quantify the quality of gynecological transvaginal ultrasound (TVS) images, which could be used in clinical practice and AI tool development. A proposed scoring system to quantify TVS image quality was created following a review of the literature. This system involved a score of 1-4 (2 = poor, 3 = suboptimal and 4 = optimal image quality) assigned by a rater for individual ultrasound images. If the image was deemed inaccurate, it was assigned a score of 1, corresponding to 'reject'. Six professionals, including two radiologists, two sonographers and two sonologists, reviewed 150 images (50 images of the uterus and 100 images of the ovaries) obtained from 50 women, assigning each image a score of 1-4. The review of all images was repeated a second time by each rater after a period of at least 1 week. Mean scores were calculated for each rater. Overall interobserver agreement was assessed using intraclass correlation coefficient (ICC), and interobserver agreement between paired professionals and intraobserver agreement for all professionals were assessed using weighted Cohen's kappa and ICC. Poor levels of interobserver agreement were obtained between the six raters for all 150 images (ICC, 0.480 (95% CI, 0.363-0.586)), as well as for assessment of the uterine images only (ICC, 0.359 (95% CI, 0.204-0.523)). Moderate agreement was achieved for the ovarian images (ICC, 0.531 (95% CI, 0.417-0.636)). Agreement between the paired sonographers and sonologists was poor for all images (ICC, 0.336 (95% CI, -0.078 to 0.619) and 0.425 (95% CI, 0.014-0.665), respectively), as well as when images were grouped into uterine images (ICC, 0.253 (95% CI, -0.097 to 0.577) and 0.299 (95% CI, -0.094 to 0.606), respectively) and ovarian images (ICC, 0.400 (95% CI, -0.043 to 0.669) and 0.469 (95% CI, 0.088-0.689), respectively). Agreement between the paired radiologists was moderate for all images (ICC, 0.600 (95% CI, 0.487-0.693)) and for their assessment of uterine images (ICC, 0.538 (95% CI, 0.311-0.707)) and ovarian images (ICC, 0.621 (95% CI, 0.483-0.728)). Weak-to-moderate intraobserver agreement was seen for each of the raters with weighted Cohen's kappa ranging from 0.533 to 0.718 for all images and from 0.467 to 0.751 for ovarian images. Similarly, for all raters, the ICC indicated moderate-to-good intraobserver agreement for all images overall (ICC ranged from 0.636 to 0.825) and for ovarian images (ICC ranged from 0.596 to 0.862). Slightly better intraobserver agreement was seen for uterine images, with weighted Cohen's kappa ranging from 0.568 to 0.808 indicating weak-to-strong agreement, and ICC ranging from 0.546 to 0.893 indicating moderate-to-good agreement. All measures were statistically significant (P < 0.001). The proposed image quality scoring system was shown to have poor-to-moderate interobserver agreement and mostly weak-to-moderate levels of intraobserver agreement. More refinement of the scoring system may be needed to improve agreement, although it remains unclear whether quantification of image quality can be achieved, given the highly subjective nature of ultrasound interpretation. Although some AI systems can tolerate labeling noise, most will favor clean (high-quality) data. As such, innovative data-labeling strategies are needed. © 2025 The Author(s). Ultrasound in Obstetrics & Gynecology published by John Wiley & Sons Ltd on behalf of International Society of Ultrasound in Obstetrics and Gynecology.
Read full abstract