Fully annotated data sets play important roles in medical image segmentation and evaluation. Expense and imprecision are the two main issues in generating ground truth (GT) segmentations. In this paper, in an attempt to overcome these two issues jointly, we propose a method, named SparseGT, which exploit variability among human segmenters to maximally save manual workload in GT generation for evaluating actual segmentations by algorithms. Pseudo ground truth (p-GT) segmentations are created by only a small fraction of workload and with human-level perfection/imperfection, and they can be used in practice as a substitute for fully manual GT in evaluating segmentation algorithms at the same precision. p-GT segmentations are generated by first selecting slices sparsely, where manual contouring is conducted only on these sparse slices, and subsequently filling segmentations on other slices automatically. By creating p-GT with different levels of sparseness, we determine the largest workload reduction achievable for each considered object, where the variability of the generated p-GT is statistically indistinguishable from inter-segmenter differences in full manual GT segmentations for that object. Furthermore, we investigate the segmentation evaluation errors introduced by variability in manual GT by applying p-GT in evaluation of actual segmentations by an algorithm. Experiments are conducted on ∼500 computed tomography (CT) studies involving six objects in two body regions, Head & Neck and Thorax, where optimal sparseness and corresponding evaluation errors are determined for each object and each strategy. Our results indicate that creating p-GT by the concatenated strategy of uniformly selecting sparse slices and filling segmentations via deep-learning (DL) network show highest manual workload reduction by ∼80-96% without sacrificing evaluation accuracy compared to fully manual GT. Nevertheless, other strategies also have obvious contributions in different situations. A non-uniform strategy for slice selection shows its advantage for objects with irregular shape change from slice to slice. An interpolation strategy for filling segmentations can achieve ∼60-90% of workload reduction in simulating human-level GT without the need of an actual training stage and shows potential in enlarging data sets for training p-GT generation networks. We conclude that not only over 90% reduction in workload is feasible without sacrificing evaluation accuracy but also the suitable strategy and the optimal sparseness level achievable for creating p-GT are object- and application-specific.
Read full abstract