Abstract Background Ulcerative colitis is an idiopathic inflammatory disorder affecting the mucosa of the colon with superficial erosion and ulcers associated with bleeding. Severity assessment using current scoring schemes such as UCEIS and MAYO relies on the subjective interpretation of the physician and fails to take into account the size of the lesions, their number and distribution. Automatic lesion detection methods can enable fine-grained assessment of lesion severity, but require training stage based on time-consuming manual annotation. Most methods currently use generic datasets that are biased towards capsule endoscopy and are not adapted to locally available hardware. Methods We learn automatic bleeding and ulcer detectors on a local data set created by the Gastroenterology group at the Bichat and Beaujon hospitals. The patients’ videos were anonymous, analysed after obtaining their consent. The study was approved by the local research study committee. To minimise expert annotation burden, only rectangular annotations are provided instead of a precise delineation of lesion boundaries (Figure 1). This leads to many mislabelled pixels, especially in the corners, and affects the evaluation of the models’ performance and our ability to find correct models. Standard sensitivity and specificity cannot be used effectively on this data-set. We propose to evaluate model sensitivity on the annotation level and keep specificity at the pixel level. On the training set, we consider that a model correctly identifies a lesion if it agrees with the expert on a subset of the annotation, and count the detected annotations weighted by their area. For robustness and ease of interpretation, we explore the set of linear classifiers, and propose an efficient sampling scheme that rejects trivial models. This method is evaluated on a database of 10 colonoscopy videos (5 training videos and 5 test videos). Results In spite of the limited quality of the annotations, we find lesion detectors with a good annotation-level sensitivity (Table 1) and visual performance (see Figure 1). The detector’s performance is computed reliably. We evaluate sensitivity and specificity on 20 random subsets containing 10\% of the images, and obtain similar performance for the same patient for all models (cf Figure 2). Conclusion Despite mislabeled pixels, we obtain lesion detectors with good performance, and we show that the performance is computed reliably. However, the inter-patient performance is variable and the best models fail on some patients (sensitivity below 20\% in some cases Figure 2). This suggests that the models are not universal and that the appearance of bleeding and ulcers should be normalised further before automatic detection.
Read full abstract