Corrections to “Ensemble of Instance Segmentation Models for Polyp Segmentation in Colonoscopy Images”

Jaeyong Kang,Jeonghwan Gwak

doi:10.1109/access.2020.2995611

Abstract

In the above-named article, we explored the ensemble model using two Mask R-CNNs with the backbones consisting of ResNet50 and ResNet101, respectively, for polyp segmentation in colonoscopy images. The proposed ensemble method has two main stages. First, it generates proposals of the regions where there might be an object based on the given input image. Second, it predicts the class of the object, refines the bounding box, and finally generates a mask in the pixel level of the object based on the first stage proposal. Both stages are connected to the backbone structure. Backbone is a Feature Pyramid Network (FPN) style deep neural network. It consists of a bottom-up pathway, a top-bottom pathway, and lateral connections. AlexNet, VGG, Inception-ResNet-v2, ResNet50, and ResNet101 are some of the popular backbone networks.

Highlights

In this work, we devised the ensemble of the two Mask R-CNNs using ResNet50 and ResNet101 as our backbones, respectively, and it was the best option, compared to the combinations consisting of the other backbones under the given experimental parameter settings [1]
[6] which has been published on 24 June 2019 evaluated the performance of Mask R-CNN with three different backbone architectures (ResNet50, ResNet101, and Inception-ResNet-v2) for polyp segmentation
Even in their article, the ensemble of ResNet50 and ResNet101 showed the best performance in terms of the all three metrics (Recall, Dice, and Jaccard), except Precision. It showed the performance improvement on each backbone network of Mask R-CNN for 10, 20 and 30 epochs, which indicates that ResNet50 and ResNet101 outperformed Inception-ResNet-v2 in terms of accuracy

Summary

Introduction

It is meaningful to note that Santos et al [3] devised a Mask R-CNN with ResNet101 as its backbone architecture where Feature Pyramid Network (FPN) was adopted to detect lesions at multiple scales by extracting features at multiple spatial resolutions. In this work, we devised the ensemble of the two Mask R-CNNs using ResNet50 and ResNet101 as our backbones, respectively, and it was the best option, compared to the combinations consisting of the other backbones under the given experimental (hyper-) parameter settings [1].

Results

Conclusion