Accelerate Literature Icon
Want to do a literature review? Try our new Literature Review workflow

Knowledge distillation application technology for Chinese NLP

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

At this stage, the popular deep neural network models often encounter problems of high latency, difficult deployment and high hardware requirements in practical applications. Knowledge distillation is a good approach to solve these problems. We adopted an innovative knowledge distillation approach and formulated data augmentation strategies for the tasks, and obtained a lightweight model with 6. 7x acceleration ratio and 13. 6x compression ratio compared to the baseline model BERT-base, and the average performance of the lightweight model reached 95% of BERT-base for each task. We continue to conduct in-depth research to investigate some of the issues that remain in the knowledge distillation phase. To address the problems in distillation model selection and model fine-tuning, we propose a teacher model and student model selection strategy and a two-stage model fine-tuning strategy before and after the knowledge distillation stage. These two strategies further improve the average performance of the models to 98% of BERT-base. Finally, we developed a lightweight model evaluation scheme based on different types of downstream tasks, which provides a reference for subsequent practical applications when encountering similar tasks.

Similar Papers
  • Research Article
  • Cite Count Icon 11
  • 10.70917/2025014
Lightweight Deep Learning Models For Edge Devices—A Survey
  • Jan 6, 2025
  • International Journal of Computer Information Systems and Industrial Management Applications
  • Aminu Musa + 5 more

As edge computing gains attention across various domains, the demand for lightweight deep learning models capable of running efffciently on resource-constrained edge devices has surged. This survey investigates the landscape of lightweight deep learning models tailored for edge computing environments. The survey explores various model compression techniques used to design and optimize deep learning models for edge deployment, including model quantization, pruning, and knowledge distillation. Emphasis is placed on strategies to reduce model size, computational complexity, and memory footprint while maintaining satisfactory performance levels. Additionally, the study examines the performances of these techniques on three real-life datasets evaluating lightweight deep learning models, highlighting the importance of balanced datasets representative of edge device deployment scenarios. Furthermore, this survey provides a comprehensive overview of the current state of lightweight deep learning models for edge devices, offering insights into design considerations, optimization techniques, and performance evaluation methodologies. The ffndings show that most of the compression techniques suffer from performance degradation, proving the existence of a trade-off between compression and performance. Therefore, we proposed a hybrid losslesscompressed model by combining pruning quantization, and knowledge distillation, to reduce parameters and weights, resulting in a lightweight model. The proposed model is three times smaller than the vanilla CNN model and achieved a state-of-the-art accuracy of 97% after compression, which shows the effectiveness of our approach. These results will serve as a valuable resource for researchers and practitioners aiming to develop efffcient and scalable deep learning solutions for edge computing applications.

  • Research Article
  • Cite Count Icon 6
  • 10.70917/ijcisim-2025-0014
Lightweight Deep Learning Models For Edge Devices—A Survey
  • Jan 6, 2025
  • International Journal of Computer Information Systems and Industrial Management Applications
  • Aminu Musa + 5 more

As edge computing gains attention across various domains, the demand for lightweight deep learning models capable of running efffciently on resource-constrained edge devices has surged. This survey investigates the landscape of lightweight deep learning models tailored for edge computing environments. The survey explores various model compression techniques used to design and optimize deep learning models for edge deployment, including model quantization, pruning, and knowledge distillation. Emphasis is placed on strategies to reduce model size, computational complexity, and memory footprint while maintaining satisfactory performance levels. Additionally, the study examines the performances of these techniques on three real-life datasets evaluating lightweight deep learning models, highlighting the importance of balanced datasets representative of edge device deployment scenarios. Furthermore, this survey provides a comprehensive overview of the current state of lightweight deep learning models for edge devices, offering insights into design considerations, optimization techniques, and performance evaluation methodologies. The ffndings show that most of the compression techniques suffer from performance degradation, proving the existence of a trade-off between compression and performance. Therefore, we proposed a hybrid losslesscompressed model by combining pruning quantization, and knowledge distillation, to reduce parameters and weights, resulting in a lightweight model. The proposed model is three times smaller than the vanilla CNN model and achieved a state-of-the-art accuracy of 97% after compression, which shows the effectiveness of our approach. These results will serve as a valuable resource for researchers and practitioners aiming to develop efffcient and scalable deep learning solutions for edge computing applications.

  • Research Article
  • Cite Count Icon 3
  • 10.1109/embc40787.2023.10340704
Cutting Weights of Deep Learning Models for Heart Sound Classification: Introducing a Knowledge Distillation Approach.
  • Jul 24, 2023
  • Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual International Conference
  • Zikai Song + 7 more

Cardiovascular diseases (CVDs) are the number one cause of death worldwide. In recent years, intelligent auxiliary diagnosis of CVDs based on computer audition has become a popular research field, and intelligent diagnosis technology is increasingly mature. Neural networks used to monitor CVDs are becoming more complex, requiring more computing power and memory, and are difficult to deploy in wearable devices. This paper proposes a lightweight model for classifying heart sounds based on knowledge distillation, which can be deployed in wearable devices to monitor the heart sounds of wearers. The network model is designed based on Convolutional Neural Networks (CNNs). Model performance is evaluated by extracting Mel Frequency Cepstral Coefficients (MFCCs) features from the PhysioNet/CinC Challenge 2016 dataset. The experimental results show that knowledge distillation can improve a lightweight network's accuracy, and our model performs well on the test set. Especially, when the knowledge distillation temperature is 7 and the weight α is 0.1, the accuracy is 88.5 %, the recall is 83.8 %, and the specificity is 93.6 %.Clinical relevance- A lightweight model of heart sound classification based on knowledge distillation can be deployed on various hardware devices for timely monitoring and feedback of the physical condition of patients with CVDs for timely provision of medical advice. When the model is deployed on the medical instruments of the hospital, the condition of severe and hospitalised patients can be timely fed back and clinical treatment advice can be provided to the clinicians.

  • Conference Article
  • Cite Count Icon 9
  • 10.1109/iscslp57327.2022.10038276
Label-free Knowledge Distillation with Contrastive Loss for Light-weight Speaker Recognition
  • Dec 11, 2022
  • Zhiyuan Peng + 4 more

Very deep models for speaker recognition (SR) have demonstrated remarkable performance improvement in recent research. However, it is impractical to deploy these models for on-device applications with constrained computational resources. On the other hand, light-weight models are highly desired in practice despite their sub-optimal performance. This research aims to improve light-weight SR models through large-scale label-free knowledge distillation (KD). Existing KD approaches for SR typically require speaker labels to learn task-specific knowledge, due to the inefficiency of conventional loss for distillation. To address the inefficiency problem and achieve label-free KD, we propose to employ the contrastive loss from self-supervised learning for distillation. Extensive experiments are conducted on a collection of public speech datasets from diverse sources. Results on light-weight SR models show that the proposed approach of label-free KD with contrastive loss consistently outperforms both conventional distillation methods and self-supervised learning methods by a significant margin.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 39
  • 10.34133/plantphenomics.0062
Knowledge Distillation Facilitates the Lightweight and Efficient Plant Diseases Detection Model.
  • Jan 1, 2023
  • Plant Phenomics
  • Qianding Huang + 7 more

Plant disease diagnosis in time can inhibit the spread of the disease and prevent a large-scale drop in production, which benefits food production. Object detection-based plant disease diagnosis methods have attracted widespread attention due to their accuracy in classifying and locating diseases. However, existing methods are still limited to single crop disease diagnosis. More importantly, the existing model has a large number of parameters, which is not conducive to deploying it to agricultural mobile devices. Nonetheless, reducing the number of model parameters tends to cause a decrease in model accuracy. To solve these problems, we propose a plant disease detection method based on knowledge distillation to achieve a lightweight and efficient diagnosis of multiple diseases across multiple crops. In detail, we design 2 strategies to build 4 different lightweight models as student models: the YOLOR-Light-v1, YOLOR-Light-v2, Mobile-YOLOR-v1, and Mobile-YOLOR-v2 models, and adopt the YOLOR model as the teacher model. We develop a multistage knowledge distillation method to improve lightweight model performance, achieving 60.4% mAP@.5 in the PlantDoc dataset with small model parameters, outperforming existing methods. Overall, the multistage knowledge distillation technique can make the model lighter while maintaining high accuracy. Not only that, the technique can be extended to other tasks, such as image classification and image segmentation, to obtain automated plant disease diagnostic models with a wider range of lightweight applicability in smart agriculture. Our code is available at https://github.com/QDH/MSKD.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 8
  • 10.3390/computers13080184
Knowledge Distillation in Image Classification: The Impact of Datasets
  • Jul 24, 2024
  • Computers
  • Ange Gabriel Belinga + 3 more

As the demand for efficient and lightweight models in image classification grows, knowledge distillation has emerged as a promising technique to transfer expertise from complex teacher models to simpler student models. However, the efficacy of knowledge distillation is intricately linked to the choice of datasets used during training. Datasets are pivotal in shaping a model’s learning process, influencing its ability to generalize and discriminate between diverse patterns. While considerable research has independently explored knowledge distillation and image classification, a comprehensive understanding of how different datasets impact knowledge distillation remains a critical gap. This study systematically investigates the impact of diverse datasets on knowledge distillation in image classification. By varying dataset characteristics such as size, domain specificity, and inherent biases, we aim to unravel the nuanced relationship between datasets and the efficacy of knowledge transfer. Our experiments employ a range of datasets to comprehensively explore their impact on the performance gains achieved through knowledge distillation. This study contributes valuable guidance for researchers and practitioners seeking to optimize image classification models through kno-featured applications. By elucidating the intricate interplay between dataset characteristics and knowledge distillation outcomes, our findings empower the community to make informed decisions when selecting datasets, ultimately advancing the field toward more robust and efficient model development.

  • Research Article
  • Cite Count Icon 2
  • 10.3390/e27040379
Lightweight Pre-Trained Korean Language Model Based on Knowledge Distillation and Low-Rank Factorization.
  • Apr 2, 2025
  • Entropy (Basel, Switzerland)
  • Jin-Hwan Kim + 1 more

Natural Language Processing (NLP) stands as a forefront of artificial intelligence research, empowering computational systems to comprehend and process human language as used in everyday contexts. Language models (LMs) underpin this field, striving to capture the intricacies of linguistic structure and semantics by assigning probabilities to sequences of words. The trend towards large language models (LLMs) has shown significant performance improvements with increasing model size. However, the deployment of LLMs on resource-limited devices such as mobile and edge devices remains a challenge. This issue is particularly pronounced in languages other than English, including Korean, where pre-trained models are relatively scarce. Addressing this gap, we introduce a novel lightweight pre-trained Korean language model that leverages knowledge distillation and low-rank factorization techniques. Our approach distills knowledge from a 432 MB (approximately 110 M parameters) teacher model into student models of substantially reduced sizes (e.g., 53 MB ≈ 14 M parameters, 35 MB ≈ 13 M parameters, 30 MB ≈ 11 M parameters, and 18 MB ≈ 4 M parameters). The smaller student models further employ low-rank factorization to minimize the parameter count within the Transformer's feed-forward network (FFN) and embedding layer. We evaluate the efficacy of our lightweight model across six established Korean NLP tasks. Notably, our most compact model, KR-ELECTRA-Small-KD, attains over 97.387% of the teacher model's performance despite an 8.15× reduction in size. Remarkably, on the NSMC sentiment classification benchmark, KR-ELECTRA-Small-KD surpasses the teacher model with an accuracy of 89.720%. These findings underscore the potential of our model as an efficient solution for NLP applications in resource-constrained settings.

  • Research Article
  • Cite Count Icon 15
  • 10.1016/j.eswa.2024.124628
Efficient and lightweight layer-wise in-situ defect detection in laser powder bed fusion via knowledge distillation and structural re-parameterization
  • Jul 4, 2024
  • Expert Systems With Applications
  • Kunpeng Tan + 6 more

Efficient and lightweight layer-wise in-situ defect detection in laser powder bed fusion via knowledge distillation and structural re-parameterization

  • Research Article
  • Cite Count Icon 3
  • 10.1038/s41598-025-97585-0
Diagnosis of early nitrogen, phosphorus and potassium deficiency categories in rice based on multimodal integration and knowledge distillation
  • Apr 15, 2025
  • Scientific Reports
  • Xuanying Liao + 1 more

Rapid, non-destructive, lightweight and accurate diagnosis of early stage nutrient deficiency in rice is essential for both yield and quality. Traditional diagnostic methods often exhibit low efficiency, reduced accuracy, and a lack of timeliness. To address these issues, a diagnostic method for the early detection of nitrogen, phosphorus, and potassium deficiencies in rice, based on multimodal integration and knowledge distillation, is proposed. In this study, the late rice variety ‘Huanghuazhan rice’ was selected as the experimental subject for field trials. First, leave images of rice plant were captured using a scanner, and some data preprocessing techniques were utilized to extract image samples from the leaf tip areas of the top one leaf, the top two leaf and the top three leaf. Second, the teacher model was obtained through transfer learning, fine-tuning training and model fusion. The custom neural network model was heuristically customized based on the conventional model. The teacher model then performs knowledge distillation on the custom neural network model, resulting in a lightweight model with high accuracy and low memory consumption, which serves as a feature extractor. Finally, the multimodal features were input into LightGBM for training and the rice nutrient deficiency recognition model, S-RiceNet-D-LightGBM (SRDL), was constructed. The experimental results demonstrate that the SRDL model is an efficient, lightweight model characterized by high accuracy and low memory consumption. It achieved an accuracy score of 0.9501, a macro precision score of 0.9501, a macro recall score of 0.9499, and a macro F1 score of 0.9500, outperforming the VGG16, ResNet101, DenseNet169, InceptionNetV3, MobileNetV2, second only to the performance of the ensemble model. The memory footprint is 23.6 MB, which is slightly higher than that of the MobileNetV3S model. This study provides new insights and viable avenues for the practical implementation of a lightweight model designed for the intelligent diagnosis of crop nutrient deficiency.

  • Research Article
  • Cite Count Icon 2
  • 10.1016/j.neunet.2025.108205
Remote sensing object detection through hierarchical feature mining and multivariate head collaboration with knowledge distillation.
  • Mar 1, 2026
  • Neural networks : the official journal of the International Neural Network Society
  • Yantong Chen + 3 more

Remote sensing object detection through hierarchical feature mining and multivariate head collaboration with knowledge distillation.

  • Conference Article
  • 10.1145/3652628.3652707
Training Model by Knowledge Distillation for Image-text Matching Use knowledge distillation method to compress pre-trained models in Image-Text matching tasks.Design lightweight models and use knowledge distillation methods to achieve better results for previously ineffective models after training.
  • Nov 17, 2023
  • Hai Liu + 2 more

In Image-Text matching tasks, advanced algorithms typically rely on deep learning models that contain complex architectures and a large number of parameters. This study proposes a knowledge distillation-based strategy for training efficient and compact models. Specifically, training two pretrained models initially, and then constructing a lightweight model with lower parameter quantities. Subsequently, using the knowledge distillation method to transfer the similarity and middle layer features of images and text to the student model. The experimental results indicate that the application of knowledge distillation enables the student models to maintain high performance while mitigating computational costs. The lightweight model demonstrates satisfactory performance on Flickr30K datasets, showcasing the practical feasibility of employing knowledge distillation techniques in Image-Text matching tasks.

  • Research Article
  • Cite Count Icon 8
  • 10.1016/j.jclepro.2024.143663
A new lightweight framework based on knowledge distillation for reducing the complexity of multi-modal solar irradiance prediction model
  • Sep 16, 2024
  • Journal of Cleaner Production
  • Yunfei Zhang + 5 more

A new lightweight framework based on knowledge distillation for reducing the complexity of multi-modal solar irradiance prediction model

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 2
  • 10.3390/app14083284
Knowledge Distillation Based on Fitting Ground-Truth Distribution of Images
  • Apr 13, 2024
  • Applied Sciences
  • Jianze Li + 3 more

Knowledge distillation based on the features from the penultimate layer allows the student (lightweight model) to efficiently mimic the internal feature outputs of the teacher (high-capacity model). However, the training data may not conform to the ground-truth distribution of images in terms of classes and features. We propose two knowledge distillation algorithms to solve the above problem from the directions of fitting the ground-truth distribution of classes and fitting the ground-truth distribution of features, respectively. The former uses teacher labels to supervise student classification output instead of dataset labels, while the latter designs feature temperature parameters to correct teachers’ abnormal feature distribution output. We conducted knowledge distillation experiments on the ImageNet-2012 and Cifar-100 datasets using seven sets of homogeneous models and six sets of heterogeneous models. The experimental results show that our proposed algorithms improve the performance of penultimate layer feature knowledge distillation and outperform other existing knowledge distillation methods in terms of classification performance and generalization ability.

  • Conference Article
  • 10.1109/dsit55514.2022.9943931
Lightweight High-Resolution Remote Sensing Scene Classification via Adaptive Enhanced Knowledge Distillation
  • Jul 22, 2022
  • Zhiming Huang + 3 more

Scene classification is a key step in intelligent information processing of high-resolution remote sensing, which aims to identify the land-use types of remote sensing image blocks. In recent years, the models based on deep convolution neural networks (CNN) have made outstanding achievements in the field of remote sensing, but these models are computationally expensive and time-consuming, while the lightweight shallow network models have few parameters, fast speed, but low accuracy. Both the deep CNN models and the lightweight shallow network models could not be directly applied to embedded devices. Therefore, we propose an Adaptive Enhanced Knowledge Distillation (AE-KD) to deeply mine the output and feature information from teacher model and transfer them to student model, so as to improve the performance of lightweight model. Firstly, aiming at the uneven degree of difference among remote sensing image categories, an adaptive temperature mechanism is proposed by improving the temperature mechanism in the traditional knowledge distillation, which promotes the student model to better learn the probability distribution knowledge from the output layer of the large and deep teacher model. And then, the spatial attention and inter-channel correlation of features are added as constraints in order to make the student model learn the multi-level knowledge from teacher model. The experimental results on UC Merced Land-Use and AID public datasets show that the proposed method reduces 91 % parameters of the teacher model and improves the prediction speed by 22 times, where it has only a small loss of classification accuracy, which is effective for the lightweight model. ablation study also further analyzes and discusses the performance improvement of the student model under different levels of knowledge distillation.

  • Conference Article
  • Cite Count Icon 106
  • 10.1109/icassp39728.2021.9415063
Towards Practical Lipreading with Distilled and Efficient Models
  • Jun 6, 2021
  • Pingchuan Ma + 3 more

Lipreading has witnessed a lot of progress due to the resurgence of neural networks. Recent works have placed emphasis on aspects such as improving performance by finding the optimal architecture or improving generalization. However, there is still a significant gap between the current methodologies and the requirements for an effective deployment of lipreading in practical scenarios. In this work, we propose a series of innovations that significantly bridge that gap: first, we raise the state-of-the-art performance by a wide margin on LRW and LRW-1000 to 88.5 % and 46.6 %, respectively using self-distillation. Secondly, we propose a series of architectural changes, including a novel Depthwise Separable Temporal Convolutional Network (DS-TCN) head, that slashes the computational cost to a fraction of the (already quite efficient) original model. Thirdly, we show that knowledge distillation is a very effective tool for recovering performance of the lightweight models. This results in a range of models with different accuracy-efficiency trade-offs. However, our most promising lightweight models are on par with the current state-of-the-art while showing a reduction of 8.2× and 3.9× in terms of computational cost and number of parameters, respectively, which we hope will enable the deployment of lipreading models in practical applications.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant