Explainable Knowledge Distillation for On-Device Chest X-Ray Classification.
This study introduces a knowledge distillation approach to develop compact, efficient deep learning models for multi-label chest X-ray classification suitable for limited hardware, achieving improved AUC scores (up to 88.7%) with fewer parameters and low computational cost, enhanced by explainable AI visualizations.
Automated multi-label chest X-rays (CXR) image classification has achieved substantial progress in clinical diagnosis via utilizing sophisticated deep learning approaches. However, most deep models have high computational demands, which makes them less feasible for compact devices with low computational requirements. To overcome this problem, we propose a knowledge distillation (KD) strategy to create the compact deep learning model for the real-time multi-label CXR image classification. We study different alternatives of CNNs and Transforms as the teacher to distill the knowledge to a smaller student. Then, we employed explainable artificial intelligence (XAI) to provide the visual explanation for the model decision improved by the KD. Our results on three benchmark CXR datasets show that our KD strategy provides the improved performance on the compact student model, thus being the feasible choice for many limited hardware platforms. For instance, when using DenseNet161 as the teacher network, EEEA-Net-C2 achieved an AUC of 83.7%, 87.1%, and 88.7% on the ChestX-ray14, CheXpert, and PadChest datasets, respectively, with fewer parameters of 4.7 million and computational cost of 0.3 billion FLOPS.
- Research Article
4
- 10.1038/s41598-024-79249-7
- Nov 15, 2024
- Scientific Reports
Given a set of labels, multi-label text classification (MLTC) aims to assign multiple relevant labels for a text. Recently, deep learning models get inspiring results in MLTC. Training a high-quality deep MLTC model typically demands large-scale labeled data. And comparing with annotations for single-label data samples, annotations for multi-label samples are typically more time-consuming and expensive. Active learning can enable a classification model to achieve optimal prediction performance using fewer labeled samples. Although active learning has been considered for deep learning models, there are few studies on active learning for deep multi-label classification models. In this work, for the deep MLTC model, we propose a deep Active Learning method based on Bayesian deep learning and Expected confidence (BEAL). It adopts Bayesian deep learning to derive the deep model’s posterior predictive distribution and defines a new expected confidence-based acquisition function to select uncertain samples for deep MLTC model training. Moreover, we perform experiments with a BERT-based MLTC model, where BERT can achieve satisfactory performance by fine-tuning in various classification tasks. The results on benchmark datasets demonstrate that BEAL enables more efficient model training, allowing the deep model to achieve training convergence with fewer labeled samples.
- Research Article
21
- 10.1016/j.jksuci.2023.101616
- Jun 14, 2023
- Journal of King Saud University - Computer and Information Sciences
Recently, deep neural networks (DNNs) have been used successfully in many fields, particularly, in medical diagnosis. However, deep learning (DL)modelsare expensive in terms of memory and computing resources, whichhinders their implementation in limited-resources devices or for delay-sensitive systems. Therefore, these deep models need to be accelerated and compressed to smaller sizes to be deployed on edge devices without noticeably affecting their performance. In this paper, recent accelerating and compression approaches of DNN are analyzed and compared regarding their performance, applications, benefits, and limitations with a more focus on the knowledge distillation approach as a successful emergent approach in this field. In addition, a framework is proposed to develop knowledge distilled DNN models that can be deployed on fog/edge devices for automatic disease diagnosis. To evaluate the proposed framework, two compressed medical diagnosis systems are proposed based on knowledge distillation deep neural models for both COVID-19 and Malaria. The experimental results show that these knowledge distilled models have been compressed by 18.4% and 15% of the original model and their responses accelerated by 6.14x and 5.86%, respectively, while there were no significant drop in their performance (dropped by 0.9% and 1.2%, respectively). Furthermore, the distilled models are compared with other pruned and quantized models. The obtained results revealed the superiority of the distilled models in terms of compression rates and response time.
- Research Article
3
- 10.1109/jbhi.2024.3365051
- May 1, 2024
- IEEE journal of biomedical and health informatics
When decoding neuroelectrophysiological signals represented by Magnetoencephalography (MEG), deep learning models generally achieve high predictive performance but lack the ability to interpret their predicted results. This limitation prevents them from meeting the essential requirements of reliability and ethical-legal considerations in practical applications. In contrast, intrinsically interpretable models, such as decision trees, possess self-evident interpretability while typically sacrificing accuracy. To effectively combine the respective advantages of both deep learning and intrinsically interpretable models, an MEG transfer approach through feature attribution-based knowledge distillation is pioneered, which transforms deep models (teacher) into highly accurate intrinsically interpretable models (student). The resulting models provide not only intrinsic interpretability but also high predictive performance, besides serving as an excellent approximate proxy to understand the inner workings of deep models. In the proposed approach, post-hoc feature knowledge derived from post-hoc interpretable algorithms, specifically feature attribution maps, is introduced into knowledge distillation for the first time. By guiding intrinsically interpretable models to assimilate this knowledge, the transfer of MEG decoding information from deep models to intrinsically interpretable models is implemented. Experimental results demonstrate that the proposed approach outperforms the benchmark knowledge distillation algorithms. This approach successfully improves the prediction accuracy of Soft Decision Tree by a maximum of 8.28%, reaching almost equivalent or even superior performance to deep teacher models. Furthermore, the model-agnostic nature of this approach offers broad application potential.
- Dissertation
- 10.17760/d20659759
- Jan 1, 2024
Artificial intelligence (AI) empowered by deep learning, has been profoundly transforming the world. However, the excessive size of these models remains a central obstacle that limits their broader utility. Modern neural networks commonly consist of millions of parameters, with foundation models extending to billions. The rapid expansion in model size introduces many challenges including training cost, sluggish inference speed, excessive energy consumption, and negative environmental implications such as increased CO2 emissions. Addressing these challenges necessitates the adoption of efficient deep learning (EDL). The dissertation focuses on two overarching approaches, network sparsity (a.k.a. pruning) and knowledge distillation, to enhance the efficiency of deep learning models in the context of computer vision. Network pruning focuses on eliminating redundant parameters in a model while preserving the performance. Knowledge distillation aims to enhance the performance of the target model, referred to as the "student", by leveraging guidance from a stronger model, known as the "teacher". This approach leads to performance improvements in the target model without reducing its size. In this dissertation, I will start with the background and motivation for more efficient deep learning models in the past several years in the context of the arising foundation models. Then, the basic concepts, goals, and challenges of EDL will be introduced along with the major sub-methods. After that, the major part of this dissertation will be dedicated to elaborating on the proposed efficiency algorithms based on pruning and distillation in a variety of applications. For the pruning part, the dissertation first presents an effective pruning algorithm GReg [27] in image classification, by tapping into a growing regularization strategy. Then, in order to understand the real progress of network pruning, a fairness principle is introduced to fairly compare different pruning methods [32]. The investigation leads us to the central role of network trainability in pruning, which has been largely overlooked by prior works. A trainability-preserving pruning approach, TPP [28], is then proposed to show the merits of maintaining trainability during pruning. A short survey [33] on an emerging new pruning paradigm, pruning at initialization, is then presented to discuss its potential and the connections with the conventional pruning after training. The GReg algorithm is further extended to a low-level vision task, single image super-resolution (SR), to explore the difference of utilizing pruning in low-level vision (SR) vs. high-level vision (image classification). Three efficient SR approaches (ASSL [29], GASSL [30], SRP [34]) are introduced. For the distillation part, the dissertation first focuses on the interaction between knowledge distillation and data augmentation in image classification [35], a proved proposition presented to rigorously understand what defines the "goodness" of a data augmentation scheme in distillation. Next, the dissertation showcases how to employ distillation to significantly improve the inference efficiency for novel view synthesis in 3D vision. Both static scenes [31] and dynamic scenes [36] are considered. Finally, SnapFusion [37] is presented to demonstrate a systematic efficiency optimization of deep models by jointly utilizing pruning and distillation, towards an unprecedentedly fast speed of text-to-image generation based on diffusion models. Finally, a comprehensive summary along with takeaways and outlooks of the future work will conclude the dissertation. Major takeaways include (1) there is no panacea towards efficient deep learning for all tasks; solution is usually case-by-case; (2) there is a clear trend that the efficiency solution for future models (especially the large models) will feature a systematical optimization and co-design in many axes (e.g., hardware, system, and algorithm); (3) profiling is always a good start to understand the problem so as to build the right efficiency portfolio.--Author's abstract
- Research Article
2
- 10.1038/s41598-025-19827-5
- Sep 24, 2025
- Scientific reports
The classification and identification of forest tree species is of great value in the study of species diversity and forest monitoring. With the development of emerging technologies, the combination of remote sensing images and deep learning methods has become an important means to study multi-label image classification. However, nowadays, due to the small difference between tree species images, the difficulty of artificial labeling, and the difficulty of obtaining data sets, there are few studies on multi-label classification for tree species images. Therefore, taking the TreeSatAI dataset as an example, a multi-branch and multi-label image classification model (MMTSC) specifically designed for multi-source remote sensing data is proposed to classify and identify 15 tree species in the dataset. In a complex forest stand scenario with unbalanced data, our F1-Score and Precision are as high as about 72% and 82%, respectively. The visualization results of the confusion matrix and Grad-CAM heat map further verify the model's recognition ability on different categories. To comprehensively evaluate the model performance, we compared it with other state-of-the-art (SOTA) methods for multi-label image classification tasks and conducted a series of ablation experiments. Experimental results show that the MMTSC model outperforms other SOTA methods in F1-Score, Precision, Recall, and mAP. In addition, we also compared the model's backbone network DenseNet121 with the classic structures of EfficientNet-B0, ConvNeXt-Tiny, ResNet-18, MobileNetV3 and RegNetX-800MF. The evaluation results showed that the DenseNet121 architecture performed best in this task, verifying its effectiveness and adaptability as a backbone network. Finally, we use the results of the deep learning-based multi-label tree species classification model for biomass estimation, providing practical suggestions for relevant institutions, thereby contributing to the scientific management of forest resources and the improvement of carbon sequestration capacity.
- Conference Article
3
- 10.1145/3394171.3413985
- Oct 12, 2020
Learning from music to visual storytelling of shots is an interesting and emerging task. It produces a coherent visual story in the form of a shot type sequence, which not only expands the storytelling potential for a song but also facilitates automatic concert video mashup process and storyboard generation. In this study, we present a deep interactive learning (DIL) mechanism for building a compact yet accurate sequence-to-sequence model to accomplish the task. Different from the one-way transfer between a pre-trained teacher network (or ensemble network) and a student network in knowledge distillation (KD), the proposed method enables collaborative learning between an ensemble teacher network and a student network. Namely, the student network also teaches. Specifically, our method first learns a teacher network that is composed of several assistant networks to generate a shot type sequence and produce the soft target (shot types) distribution accordingly through KD. It then constructs the student network that learns from both the ground truth label (hard target) and the soft target distribution to alleviate the difficulty of optimization and improve generalization capability. As the student network gradually advances, it turns to feed back knowledge to the assistant networks, thereby improving the teacher network in each iteration. Owing to such interactive designs, the DIL mechanism bridges the gap between the teacher and student networks and produces more superior capability for both networks. Objective and subjective experimental results demonstrate that both the teacher and student networks can generate more attractive shot sequences from music, thereby enhancing the viewing and listening experience.
- Research Article
32
- 10.1109/tip.2021.3101158
- Jan 1, 2021
- IEEE Transactions on Image Processing
Minimizing the computation complexity is essential for the popularization of deep networks in practical applications. Nowadays, most researches attempt to accelerate deep networks by designing new network structure or compressing the network parameters. Meanwhile, transfer learning techniques such as knowledge distillation are utilized to keep the performance of deep models. In this paper, we focus on accelerating deep models and relieving the computation burden by using low-resolution (LR) images as inputs while maintaining competitive performance, which is rarely researched in the current literature. Deep networks may encounter serious performance degradation when using LR inputs because many details are unavailable from LR images. Besides, the existing approaches may fail to learn discriminative features for LR images because of the dramatic appearance variations between LR and high-resolution (HR) images. To tackle with the above problems, we propose a resolution-aware knowledge distillation (RKD) framework to narrow the cross-resolution variations by transferring knowledge from HR domain to LR domain. The proposed framework consists of a HR teacher network and a LR student network. First, we introduce a discriminator and propose an adversarial learning strategy to shrink the variations between inputs with changing resolution. Then we design a cross-resolution knowledge distillation (CRKD) loss to train discriminative student network by exploiting the knowledge of the teacher network. The CRKD loss is consisted of a resolution-aware distillation loss, a pair-wise constraint, and a maximum mean discrepancy loss. Experimental results on person re-identification, image classification, face recognition, and defect segmentation tasks demonstrate that RKD outperforms traditional knowledge distillation method by achieving better performance with lower computation complexities. Furthermore, CRKD surpasses the state-of-the-art knowledge distillation methods in transferring knowledge across different resolutions under RKD framework, especially when coping with large resolution differences.
- Research Article
19
- 10.3390/e23020204
- Feb 7, 2021
- Entropy (Basel, Switzerland)
The Coronavirus disease 2019 (COVID-19) has become one of the threats to the world. Computed tomography (CT) is an informative tool for the diagnosis of COVID-19 patients. Many deep learning approaches on CT images have been proposed and brought promising performance. However, due to the high complexity and non-transparency of deep models, the explanation of the diagnosis process is challenging, making it hard to evaluate whether such approaches are reliable. In this paper, we propose a visual interpretation architecture for the explanation of the deep learning models and apply the architecture in COVID-19 diagnosis. Our architecture designs a comprehensive interpretation about the deep model from different perspectives, including the training trends, diagnostic performance, learned features, feature extractors, the hidden layers, the support regions for diagnostic decision, and etc. With the interpretation architecture, researchers can make a comparison and explanation about the classification performance, gain insight into what the deep model learned from images, and obtain the supports for diagnostic decisions. Our deep model achieves the diagnostic result of 94.75%, 93.22%, 96.69%, 97.27%, and 91.88% in the criteria of accuracy, sensitivity, specificity, positive predictive value, and negative predictive value, which are 8.30%, 4.32%, 13.33%, 10.25%, and 6.19% higher than that of the compared traditional methods. The visualized features in 2-D and 3-D spaces provide the reasons for the superiority of our deep model. Our interpretation architecture would allow researchers to understand more about how and why deep models work, and can be used as interpretation solutions for any deep learning models based on convolutional neural network. It can also help deep learning methods to take a step forward in the clinical COVID-19 diagnosis field.
- Conference Article
78
- 10.1145/2072298.2072344
- Nov 28, 2011
19th ACM International Conference on Multimedia ACM Multimedia 2011, MM'11, Scottsdale, AZ, 28-1 December 2011
- Research Article
11
- 10.1007/s11277-023-10336-0
- Jan 1, 2023
- Wireless Personal Communications
COVID-19 is an epidemic disease that has threatened all the people at worldwide scale and eventually became a pandemic It is a crucial task to differentiate COVID-19-affected patients from healthy patient populations. The need for technology enabled solutions is pertinent and this paper proposes a deep learning model for detection of COVID-19 using Chest X-Ray (CXR) images. In this research work, we provide insights on how to build robust deep learning based models for COVID-19 CXR image classification from Normal and Pneumonia affected CXR images. We contribute a methodical escort on preparation of data to produce a robust deep learning model. The paper prepared datasets by refactoring, using images from several datasets for ameliorate training of deep model. These recently published datasets enable us to build our own model and compare by using pre-trained models. The proposed experiments show the ability to work effectively to classify COVID-19 patients utilizing CXR. The empirical work, which uses a 3 convolutional layer based Deep Neural Network called “DeepCOVNet” to classify CXR images into 3 classes: COVID-19, Normal and Pneumonia cases, yielded an accuracy of 96.77% and a F1-score of 0.96 on two different combination of datasets.
- Book Chapter
2
- 10.1007/978-981-19-7402-1_40
- Jan 1, 2023
Recently, number of medical X-ray images being generated is increasing rapidly due to the advancements in radiological equipment in medical centres. Medical X-ray image classification techniques are needed for effective decision making in the healthcare sector. Since the traditional image classification models are ineffective to accomplish maximum X-ray image classification performance, deep learning (DL) models have emerged. In this study, an Arithmetic Optimization Algorithm with Deep Learning-Based Medical X-Ray Image Classification (AOADL-MXIC) model has been developed. The proposed AOADL-MXIC model investigates the available X-ray images for the identification of diseases. Initially, the AOADL-MXIC model executes the pre-processing step using the Gabor filtering (GF) technique to eliminate the presence of noise. In the next level, the Capsule Network (CapsNet) model is utilized to derive feature vectors from the input X-ray images. Furthermore, for optimizing the hyperparameters related to the CapsNet approach, the AOA is exploited. Finally, the bidirectional gated recurrent unit (BiGRU) model is employed for the classification of medical X-ray images. The experimental result analysis of the AOADL-MXIC technique on a set of medical images stated the promising performance over the other models.KeywordsX-ray imagesArithmetic optimization algorithmDeep learningFeature extractionHyperparameter tuning
- Research Article
8
- 10.3390/computers13080184
- Jul 24, 2024
- Computers
As the demand for efficient and lightweight models in image classification grows, knowledge distillation has emerged as a promising technique to transfer expertise from complex teacher models to simpler student models. However, the efficacy of knowledge distillation is intricately linked to the choice of datasets used during training. Datasets are pivotal in shaping a model’s learning process, influencing its ability to generalize and discriminate between diverse patterns. While considerable research has independently explored knowledge distillation and image classification, a comprehensive understanding of how different datasets impact knowledge distillation remains a critical gap. This study systematically investigates the impact of diverse datasets on knowledge distillation in image classification. By varying dataset characteristics such as size, domain specificity, and inherent biases, we aim to unravel the nuanced relationship between datasets and the efficacy of knowledge transfer. Our experiments employ a range of datasets to comprehensively explore their impact on the performance gains achieved through knowledge distillation. This study contributes valuable guidance for researchers and practitioners seeking to optimize image classification models through kno-featured applications. By elucidating the intricate interplay between dataset characteristics and knowledge distillation outcomes, our findings empower the community to make informed decisions when selecting datasets, ultimately advancing the field toward more robust and efficient model development.
- Research Article
8
- 10.1038/s41598-024-69813-6
- Aug 14, 2024
- Scientific Reports
This paper presents a Cosine Similarity-Based Knowledge Distillation (CSKD) for robust, lightweight object detectors. Knowledge Distillation (KD) has been effective in enhancing the performance of compact models in image classification by leveraging deep CNN models. However, the complex and multifaceted nature of object detection, characterized by its modular design and multitasking requirements, poses significant challenges for traditional KD techniques. These challenges are further compounded by the conventional reliance on the Mean Squared Error (MSE) loss function and the limited application of enhanced feature representations to the training phase. Addressing these limitations, the proposed CSKD method combines cosine similarity guidance with MSE loss to facilitate a more effective knowledge transfer from the teacher model to the student model. This is achieved by distilling both intermediate features and prediction outputs, aided by an assistant prediction branch designed to learn directly from the teacher’s predictions. This dual-faceted distillation strategy enables the student model to better mimic the teacher model’s behavior, leading to improved performance. The proposed method demonstrates versatility and robustness across various object detector architectures without the need for additional feature enhancement layers during training. Notably, employing ResNet-50 as the teacher model and ResNet-18 as the student model, we achieve new benchmarks in KD for object detection across several architectures, including Faster-RCNN, RetinaNet, FCOS, and GFL, with respective mAP scores of 36.6, 35.2, 35.9, and 38.9. These results highlights the effectiveness of CSKD in advancing the state-of-the-art in KD for object detection, offering a compelling solution to the challenges previously faced by traditional KD methods in this domain. The code of the proposed CSKD is available at https://github.com/swkdn16/CSKD.
- Conference Article
7
- 10.1109/icaice54393.2021.00127
- Nov 1, 2021
Model fusion can effectively improve the effect of model prediction, but it will bring about an increase in time. In this paper, the dual-stage progressive knowledge distillation is improved in combination with multi-teacher knowledge distillation technology. A simple and effective multi-teacher's Softtarget integration method is proposed in multi-teacher network knowledge distillation. Improve the guiding role of excellent models in knowledge distillation. Dual-stage progressive knowledge distillation is a method for small sample knowledge distillation. A progressive network grafting method is used to realize knowledge distillation in a small sample environment. In the first step, the student blocks are grafted one by one onto the teacher network and intertwined with other teacher blocks for training, and the training process only updates the parameters of the grafted blocks. In the second step, the trained student blocks are grafted onto the teacher network in turn, so that the learned student blocks adapt to each other and finally replace the teacher network to obtain a lighter network structure. Using Softtarget acquired by this method in Dual-stage progressive knowledge distillation instead of Hardtarget training, excellent results were obtained on BreakHis data sets.
- Research Article
15
- 10.1016/j.eswa.2024.123892
- Apr 6, 2024
- Expert Systems With Applications
Coordinate Attention Guided Dual-Teacher Adaptive Knowledge Distillation for image classification