There is an increasing research focus on the role of machine learning in the haematology laboratory, particularly in blood cell morphologic assessment. Human-level performance is an important baseline and goal for machine learning. This study aims to assess the interobserver variability and human-level performance in blood cell morphologic assessment. A dataset of 1000 single white blood cell images were independently labelled by 10 doctors and morphology scientists. Interobserver variability was calculated using Fleiss' kappa. Observers' labels were then separated into consensus labels used to determine ground truth, and performance labels used to assess observer performance. A machine learning model was trained and assessed using the same cell images. Explainability images (XRAI and IG) were generated for each of the test images. The Fleiss kappa for all 10 observers was 0.608, indicating substantial agreement between observers. The accuracy of human observers was 95%, with sensitivity 72% and specificity 97%. The accuracy of the machine learning model was 95%, with sensitivity 71% and specificity 97%. The model shared similar performance across labels when compared to humans. Explainability metrics demonstrated that the machine learning model was able to differentiate between the cytoplasm and nucleus of the cells, and used these features to perform predictions. The substantial, though not perfect, agreement between human observers highlights the inherent subjectivity in white blood cell morphologic assessment. A machine learning model performed similarly to human observers in single white blood cell identification. Further research is needed to compare human-level and machine learning performance in ways that more closely reflect the typical process of morphologic assessment.
Read full abstract