Inter-observer Variability Analysis of Automatic Lung Delineation in Normal and Disease Patients.

Luca Saba,Chue R Ng,Ashari Yunus,Omar M Rijal,Rosminah M Kassim,Jasjit S Suri,Joel C M Than,Norliza M Noor

doi:10.1007/s10916-016-0504-7

Abstract

Human interaction has become almost mandatory for an automated medical system wishing to be accepted by clinical regulatory agencies such as Food and Drug Administration. Since this interaction causes variability in the gathered data, the inter-observer and intra-observer variability must be analyzed in order to validate the accuracy of the system. This study focuses on the variability from different observers that interact with an automated lung delineation system that relies on human interaction in the form of delineation of the lung borders. The database consists of High Resolution Computed Tomography (HRCT): 15 normal and 81 diseased patients' images taken retrospectively at five levels per patient. Three observers manually delineated the lungs borders independently and using software called ImgTracer™ (AtheroPoint™, Roseville, CA, USA) to delineate the lung boundaries in all five levels of 3-D lung volume. The three observers consisted of Observer-1: lesser experienced novice tracer who is a resident in radiology under the guidance of radiologist, whereas Observer-2 and Observer-3 are lung image scientists trained by lung radiologist and biomedical imaging scientist and experts. The inter-observer variability can be shown by comparing each observer's tracings to the automated delineation and also by comparing each manual tracing of the observers with one another. The normality of the tracings was tested using D'Agostino-Pearson test and all observers tracings showed a normal P-value higher than 0.05. The analysis of variance (ANOVA) test between three observers and automated showed a P-value higher than 0.89 and 0.81 for the right lung (RL) and left lung (LL), respectively. The performance of the automated system was evaluated using Dice Similarity Coefficient (DSC), Jaccard Index (JI) and Hausdorff (HD) Distance measures. Although, Observer-1 has lesser experience compared to Obsever-2 and Obsever-3, the Observer Deterioration Factor (ODF) shows that Observer-1 has less than 10% difference compared to the other two, which is under acceptable range as per our analysis. To compare between observers, this study used regression plots, Bland-Altman plots, two tailed T-test, Mann-Whiney, Chi-Squared tests which showed the following P-values for RL and LL: (i) Observer-1 and Observer-3 were: 0.55, 0.48, 0.29 for RL and 0.55, 0.59, 0.29 for LL; (ii) Observer-1 and Observer-2 were: 0.57, 0.50, 0.29 for RL and 0.54, 0.59, 0.29 for LL; (iii) Observer-2 and Observer-3 were: 0.98, 0.99, 0.29 for RL and 0.99, 0.99, 0.29 for LL. Further, CC and R-squared coefficients were computed between observers which came out to be 0.9 for RL and LL. All three observers however manage to show the feature that diseased lungs are smaller than normal lungs in terms of area.

Full Text