Generalization of deep learning (DL) algorithms is critical for the secure implementation of computer-aided diagnosis systems in clinical practice. However, broad generalization remains to be a challenge in machine learning. This research aims to identify and study potential factors that can affect the internal validation and generalization of DL networks, namely the institution where the images come from, the image processing applied by the X-ray device, and the type of response function of the X-ray device. For these purposes, a pre-trained convolutional neural network (CNN) (VGG16) was trained three times for classifying COVID-19 and control chest radiographs with the same hyperparameters, but using different combinations of data acquired in two institutions by three different X-ray device manufacturers. Regarding internal validation, the addition of images from an external institution to the training set did not modify the algorithm’s internal performance, however, the inclusion of images acquired by a device from a different manufacturer decreased the performance up to 8% (p < 0.05). In contrast, generalization across institutions and X-ray devices with the same type of response function was achieved. Nonetheless, generalization was not observed across devices with different types of response function. This factor was the key impediment to achieving broad generalization in our research, followed by the device’s image-processing and the inter-institutional differences, which both reduced generalization performance to 18.9% (p < 0.05), and 9.8% (p < 0.05), respectively. Finally, clustering analysis with features extracted by the CNN was performed, revealing a substantial dependence of feature values extracted by the pre-trained CNN on the X-ray device which acquired the images.
Read full abstract