Articles published on knowledge-distillation
Authors
Select Authors
Journals
Select Journals
Duration
Select Duration
4488 Search results
Sort by Recency
- Research Article
1
- 10.1609/aaai.v40i9.37684
- Mar 14, 2026
- Proceedings of the AAAI Conference on Artificial Intelligence
- Zhenjie Liu + 4 more
Recent advancements in video diffusion models have significantly enhanced audio-driven portrait animation. However, current methods still suffer from flickering, identity drift, and poor audio-visual synchronization. These issues primarily stem from entangled appearance-motion representations and unstable inference strategies. In this paper, we introduce ConsistTalk, a novel intensity-controllable and temporally consistent talking head generation framework with diffusion noise search inference. First, we propose an optical flow-guided temporal module (OFT) that decouples motion features from static appearance by leveraging facial optical flow, thereby reducing visual flicker and improving temporal consistency. Second, we present an Audio-to-Intensity (A2I) model obtained through multimodal teacher-student knowledge distillation. By transforming audio and facial velocity features into a frame-wise intensity sequence, the A2I model enables joint modeling of audio and visual motion, resulting in more natural dynamics. This further enables fine-grained, frame-wise control of motion dynamics while maintaining tight audio-visual synchronization. Third, we introduce a diffusion noise initialization strategy (IC-Init). By enforcing explicit constraints on background coherence and motion continuity during inference-time noise search, we achieve better identity preservation and refine motion dynamics compared to the current autoregressive strategy. Extensive experiments demonstrate that ConsistTalk significantly outperforms prior methods in reducing flicker, preserving identity, and delivering temporally stable, high-fidelity talking head videos.
- Research Article
- 10.1609/aaai.v40i13.38028
- Mar 14, 2026
- Proceedings of the AAAI Conference on Artificial Intelligence
- Riling Wei + 5 more
Cross-modal Knowledge Distillation has demonstrated promising performance on paired modalities with strong semantic connections, referred to as Symmetric Cross-modal Knowledge Distillation (SCKD). However, implementing SCKD becomes exceedingly constrained in real-world scenarios due to the limited availability of paired modalities. To this end, we investigate a general and effective knowledge learning concept under weak semantic consistency, dubbed Asymmetric Cross-modal Knowledge Distillation (ACKD), aiming to bridge modalities with limited semantic overlap. Nevertheless, the shift from strong to weak semantic consistency improves flexibility but exacerbates challenges in knowledge transmission costs, which we rigorously verified based on optimal transport theory. To mitigate the issue, we further propose a framework, namely SemBridge, integrating a Student-Friendly Matching module and a Semantic-aware Knowledge Alignment module. The former leverages self-supervised learning to acquire semantic-based knowledge and provide personalized instruction for each student sample by dynamically selecting the relevant teacher samples. The latter seeks the optimal transport path by employing Lagrangian optimization. To facilitate the research, we curate a benchmark dataset derived from two modalities, namely Multi-Spectral (MS) and asymmetric RGB images, tailored for remote sensing scene classification. Comprehensive experiments exhibit that our framework achieves state-of-the-art performance compared with 7 existing approaches on 6 different model architectures across various datasets.
- Research Article
- 10.1088/1361-6501/ae4cab
- Mar 13, 2026
- Measurement Science and Technology
- Yihong Zhang + 5 more
Abstract Training lightweight traffic object detectors with knowledge distillation (KD) is crucial for intelligent transportation systems under resource-limited conditions. However, most KD approaches still rely on fixed thresholding to generate binary masks for student feature reconstruction, disrupting semantic continuity in complex traffic scenes. In this work, a Dual Attention Distillation Framework (DADF) using Adaptive Feature Fusion is proposed for traffic object recognition. Instead of binary masks, DADF produces Softmax-based normalized distributions soft masks along both spatial and channel dimensions, thereby more effectively regulating the continuity of semantic information. To adaptively balance spatial and channel cues, teacher feature variances are utilized for weighting and fusing the masks into a unified attention map. Meanwhile, a multilayer perceptron (MLP) generator is subsequently used to reconstruct the masked student features. Finally, the distillation process is optimized by minimizing the mean squared error (MSE) between the reconstructed and teacher features. We extensively validated the effectiveness of the DADF method across multiple datasets and detectors. On Cityscapes, it boosts YOLOv8 mAP from 41.9% to 44.1%, while cutting parameters and GFLOPs by 73.0% and 71.6%, and raising inference speed from 188.7 to 202.8 FPS. On KITTI, DADF boosts the RT-DETR mAP from 85.8% to 90.5%, even surpassing its teacher model. It also cuts parameters by 31.0%, reduces GFLOPs by 32.5%, and increases speed from 33.8 to 35.3 FPS. These results highlight DADF’s suitability for traffic measurement applications under resource constraints.
- Research Article
- 10.3390/bioengineering13030339
- Mar 13, 2026
- Bioengineering (Basel, Switzerland)
- Alin Adrian Alecu
Physiological dysregulation arising from chronic stress is a key mechanism linking psychosocial factors to long-term health outcomes, yet early identification typically relies on invasive or resource-intensive measurements. This study evaluates whether high-dimensional psychometric survey data can support scalable, non-invasive screening for latent physiological dysregulation. Using longitudinal data from the Midlife in the United States (MIDUS) Waves 2 and 3, we develop a screening-oriented modeling framework that separates longitudinal risk estimation from deployable screening model construction. Physiological targets are defined across inflammatory, metabolic, and neuroendocrine domains using three canonical allostatic load formulations. A teacher-ranking-pruning-student pipeline combines stable feature ranking, parsimony-driven dimensionality reduction, and knowledge distillation. Predictor dimensionality is reduced by more than an order of magnitude without loss of screening performance. Distilled student models consistently outperform linear, tree-based, and direct neural baselines, achieving area under the receiver operating characteristic curve values up to approximately 0.78 and substantial precision-recall lift over baseline prevalence. Longitudinal information is exploited during model development but not required at inference, enabling deployment using psychometric data alone. These findings demonstrate the feasibility of non-invasive screening for latent physiological dysregulation and provide a generalizable framework for translating longitudinal cohort data into deployable population health tools.
- Research Article
- 10.1109/tpami.2026.3672655
- Mar 12, 2026
- IEEE transactions on pattern analysis and machine intelligence
- Xun Yang + 5 more
One-shot Federated Learning (OFL) has emerged as a promising paradigm, enabling global model training with minimal communication overhead. In OFL, the server model is usually distilled from an ensemble of pre-trained client models, while the ensemble also facilitates synthetic data generation for the knowledge distillation process. Prior works show that the performance of the final model is fundamentally tied to both the quality of the synthetic data and the ensemble. However, existing methods often optimize these two components separately, overlooking their interaction. To address this coupled optimization problem and provide a unified solution to the dual challenges of data and model heterogeneity inherent in OFL, we introduce Co-Boosting++, a novel OFL framework where synthetic data generation and ensemble construction mutually enhance each other in an iterative fashion. First, we fix the ensemble and generate hard samples in an adversarial manner. These samples are crucial for enhancing the robustness of knowledge transfer, as they challenge the model to generalize better, thereby improving quality of the synthetic data and subsequent distillation process. Second, leveraging these hard samples, we enhance the ensemble via a Mixture of Experts (MoE) mechanism. MoE allows dynamic adjustment of ensemble weights based on the generated hard samples, which enables the ensemble to better capture diverse and heterogeneous knowledge from client models. Furthermore, we extend Co-Boosting++ to support the simultaneous generation of multiple heterogeneous target models, enabling efficient adaptation to diverse device constraints. Extensive experiments on benchmark datasets demonstrate that Co-Boosting++ consistently outperforms state-of-the-art methods due to its coupled optimization of data and ensemble quality. Additionally, Co-Boosting++ is highly practical in real-world model market scenarios, requiring no local training modifications, additional transmissions, or restrictions on client model architectures. Our code is available at https://github.com/rong-dai/Co-Boosting-PP.
- Research Article
- 10.1177/24056456261426603
- Mar 11, 2026
- Web Intelligence
- Lisha Gao + 6 more
Pretrained transformer models have demonstrated excellent performance on complex tasks. To improve their inference efficiency, recent studies have introduced the multi-exit mechanism, which enables early exiting through multiple intermediate classifiers. However, the deep architectures of pretrained transformers cause severe gradient conflicts during multi-exit fine-tuning, leading to degraded shallow-exit accuracy and reduced early-exit efficiency. To address this issue, we propose Separate Reverse, a multi-exit training strategy specifically designed for pretrained transformer models. The method iteratively integrates reverse iterative optimization and hierarchical knowledge distillation from deeper to shallower exits, maintaining pretrained parameter integrity, enhances the representation capacity of shallow exits, and coordinates gradient updates across exits to achieve a balanced optimization between shallow and deep classifiers. Experiments on multiple GLUE benchmark datasets using BERT demonstrate that our method significantly improves shallow-exit accuracy, maintains main-exit performance, and accelerates inference for simple samples by a large margin.
- Research Article
- 10.3390/s26061780
- Mar 11, 2026
- Sensors (Basel, Switzerland)
- Abdullah Alshammari
The fast growth of edge-cloud computing infrastructures has increased the cybersecurity burden even as it has substantially amplified the energy use and carbon footprint of intrusion detection systems (IDSs). In order to overcome this challenge, this paper suggests GreenShield, which is a framework of low-carbon cybersecurity involving lightweight cryptography, deep learning that is energy efficient, and carbon conscious system optimization across distributed edges and in cloud setup. GreenShield employs a hierarchical federated learning architecture with integrated knowledge distillation and a carbon-aware scheduling controller that dynamically adjusts security response execution based on threat intensity and renewable energy availability. As extensive experiments on the UNSW-NB15 and CIC-IDS2017 datasets show, GreenShield attains 98.73% detection accuracy and is 67.4% more energy efficient than traditional deeplearning-based IDSs. Further, the suggested system reduces the operational carbon emissions up to 97.6%, which is equivalent to a reduction of around 2.8 kg CO2-equivalent/per hour in a typical edge-deployment situation, yet it does not undermine the performance of the detection. These findings suggest that GreenShield can be one of the meaningful alternatives in providing viable and scalable sustainable cybersecurity that supports carbon-conscious security workflows in the future edge-cloud computing architecture.
- Research Article
- 10.3390/computers15030184
- Mar 11, 2026
- Computers
- Mohamed Echchidmi + 1 more
Insect bites are a common cause of skin irritation and can contribute to disease transmission through vector-borne pathogens. Early identification of the likely biting organism can assist preliminary guidance (e.g., monitoring for warning signs, considering exposure history) and may reduce complications through timely follow-up. This paper studies a compact attention-guided learning framework for multiclass insect-bite image classification under strict storage constraints. A teacher network (BiteAI-T) based on MobileNetV3-Small is trained with spatial attention pooling to emphasize lesion-relevant regions while maintaining an efficient backbone. A lightweight depthwise-separable student (BiteAI-S) is trained using multi-level knowledge distillation that combines softened-logit matching with intermediate supervision through attention-map alignment and pooled-feature matching. Model storage is further reduced through weight-only quantization-aware training using an LSQ-inspired learnable scaling factor; BatchNorm running statistics are frozen during quantization fine-tuning to improve stability. Experiments on an eight-class dataset (ants, bed bugs, chiggers, fleas, mosquitos, no bites, spiders, ticks) show that BiteAI-T reaches 93.75% test accuracy. For deployment, we export (i) a TorchScript Lite teacher artifact (BiteAI-TLite, 2.35 MB) and (ii) a weight-only int8 student artifact (BiteAI-Sint8, 0.992 MB). Comparative results are also reported for an SVD-compressed + fine-tuned FP16 variant (92.66% test accuracy, 2.84 MB), illustrating accuracy–size trade-offs across compression strategies.
- Research Article
- 10.1007/s00530-026-02245-6
- Mar 11, 2026
- Multimedia Systems
- Mingfu Zhu + 2 more
Cross-branch knowledge distillation via shallow layer guidance
- Research Article
- 10.13052/jwe1540-9589.2523
- Mar 10, 2026
- Journal of Web Engineering
- Jiang Jiang + 1 more
With the widespread application of recommendation systems in e-commerce, education, and other fields, the heterogeneity of cross-scenario data and the insufficient integration of multi-modal information such as text, images, and user behavior are becoming increasingly prominent. To achieve cross-scenario multi-modal knowledge fusion and knowledge recommendation, a meta doubly robust-debiasing knowledge distillation (MDR-DKD) model is proposed. This model efficiently extracts universal features cross-scenarios using a small amount of unbiased data through a meta-learning mechanism and optimizes the model by combining knowledge distillation techniques. Finally, combined with the knowledge recommendation module, targeted knowledge recommendation is achieved by calculating the matching degree between user interests and knowledge nodes. The results showed that the multi-modal feature extraction of the model took an average of 18.61 ms, the parameter utilization rate during the feature extraction process was 91.3%, the feature extraction throughput reached 2460 samples/s, and the knowledge recommendation accuracy was 97.84%. This model can effectively extract cross-scenario multi-modal features for accurate knowledge recommendation. The research provides an effective technical path for cross-domain knowledge recommendation, which can promote the implementation of recommendation systems in multi-scenario and multi-modal practical scenarios, and help improve the personalized recommendation experience for users.
- Research Article
- 10.1371/journal.pdig.0001275.r003
- Mar 10, 2026
- PLOS Digital Health
- Saadat Hasan Khan + 3 more
State Department of Health (DOH) websites serve as authoritative sources of HPV-related health communications, presenting state-specific content that influences public awareness and vaccination decisions. We develop a computationally efficient framework to systematically evaluate these information repositories based on their content quality, completeness, and their motivational impact on vaccination behavior. We propose a dataset consolidating 48 different DOH websites’ data targeted towards HPV and HPV vaccination. By developing an annotated dataset (n = 400), efficient prompting techniques and a Knowledge Distillation framework, we develop and evaluate efficient student models based on the Llama family of Large Language Models (LLMs) and the RoBERTa Large encoder architecture. We finally deploy the best-performing student model for a computationally feasible evaluation of the content of DOH websites. We show that fine-tuned RoBERTa Large model achieves an F1 score of 0.74 on the test set, outperforming all other student models and approaching the teacher model's performance (F1 = 0.77). The fine-tuned RoBERTa-Large model is subsequently applied to data from various state DOH websites to evaluate the information presented. We also discuss the broader implications, limitations, and ethical and legal considerations of the proposed approach.
- Research Article
- 10.1093/jamia/ocag032
- Mar 10, 2026
- Journal of the American Medical Informatics Association : JAMIA
- Junmo Kim + 3 more
Traditional electronic health record (EHR) foundation models fail to process unseen medical codes, limiting generalizability across institutions with different vocabularies. To address this problem, we introduce medical concept representation (MedRep), standardized medical concept representations for EHR foundation models, enabling recognition of semantically similar concepts regardless of their specific IDs. We utilized Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) vocabulary covering 7.5 million concepts from 66 medical vocabularies. MedRep integrates large language model-generated concept descriptions and OMOP graph ontology using graph contrastive learning with knowledge distillation. We evaluated MedRep-based models on MIMIC-IV (internal validation) and EHRSHOT (external validation) across 9 prediction tasks including clinical outcomes, phenotypes, and in-hospital events. MedRep consistently outperformed baseline models, particularly in external validation with average improvements of 0.088 in area under the receiver operating characteristic curve and 0.208 in area under the precision-recall curve. Qualitative analysis demonstrated that MedRep-based models identified more clinically relevant concepts when making decisions than the baseline models. Performance improvements remained stable across diverse EHR foundation model architectures, including BEHRT, Med-BERT, and CDM-BERT. MedRep improves the generalizability of EHR foundation models by encouraging similar concepts to have similar representations. EHR foundation models developed at different institutions could cooperate through MedRep, merging knowledge from multiple hospital datasets. In addition, our approach could reduce healthcare disparities by enabling smaller institutions to benefit from models trained on larger datasets. MedRep improves EHR foundation model performance, interpretability, and generalizability, serving as a standard baseline representation for EHR foundation models adopting OMOP CDM.
- Research Article
- 10.1038/s41598-026-35627-x
- Mar 10, 2026
- Scientific reports
- Ahmed Elzayat + 8 more
Multimodal brain tumor segmentation models often struggle to generalize across diverse populations due to variations in tumor pathology, patient demographics, and imaging protocols. A common approach to mitigate these challenges involves training separate models per population, employing ensemble methods, fine-tuning pretrained networks, or adopting curriculum learning strategies. While these approaches may yield improvements within specific domains, they often suffer from limited scalability, increased inference cost, poor adaptability to heterogeneous populations, and susceptibility to overfitting or catastrophic forgetting. To address these challenges, we propose a novel Multi-Teacher Single-Student Knowledge Distillation framework (MTSS-KDNet), built on the specialized knowledge of individual teacher models and distilling their collective expertise into a unified student model. Our framework performs population-aware knowledge transfer, guiding the student to integrate the strengths of multiple specialized teachers through both latent- and output-level supervision. This enables effective and independent generalization across all tumor types. In this paper, we focus on five distinct tumor populations: Adult Gliomas, Pediatric Gliomas, Sub-Saharan African Gliomas—which, although pathologically similar to their adult counterparts, often suffer from degraded MRI image quality—Intracranial Meningiomas and Brain Metastases. These tumor types exhibit unique developmental, morphological, anatomical and imaging characteristics, introducing heterogeneity that poses significant challenges to the ability of models to generalize accurately. Our approach achieves superior performance across all five populations, with average dice scores (DSC) of 0.87, 0.84, and 0.77 in the whole tumor (WT), tumor core (TC) and enhancing tumor (ET) regions, respectively, outperforming both population-specific and strong benchmark models. These results highlight the robustness and versatility of our method, offering a promising solution for enhancing generalizability in brain tumor segmentation while facilitating seamless clinical deployment.
- Research Article
- 10.3390/rs18050842
- Mar 9, 2026
- Remote Sensing
- Md Rezaul Karim Khan + 1 more
Accurate and reliable vehicle detection, tracking, and counting across different surveillance platforms are fundamental requirements for developing smart Traffic Management Systems (TMS) and promoting sustainable urban mobility. Recent advances in both ground-level surveillance and remote sensing using deep learning have opened new opportunities for extracting detailed vehicular information from high-resolution aerial and surveillance video data. Our research reported here aims to present a unified, real-time vehicle analysis framework that integrates lightweight deep learning–based detection, robust multi-object tracking, and trajectory-driven counting within a single modular pipeline. The proposed framework employs a “You Only Look Once” system, YOLOv10-S as the detection backbone and enhances its robustness through supervision-level knowledge distillation without introducing any architectural modifications. Temporal consistency is enforced using an observation-centric multi-object tracking algorithm (OC-SORT), enabling stable identity preservation under camera motion and dense traffic conditions. Vehicle counting is performed using a trajectory-based virtual gate strategy, reducing duplicate counts and improving counting reliability. Comprehensive experiments conducted on the UA-DETRAC and VisDrone benchmarks show that the proposed framework effectively balances detection performance, tracking robustness, counting accuracy, and real-time efficiency in both ground-based and aerial surveillance settings. Furthermore, cross-dataset evaluations under direct train–test transfer highlight the inherent challenges of domain shift while showing that knowledge distillation consistently improves robustness in detection, tracking identity consistency, and vehicle counting. Overall, this framework enables effective real-world traffic monitoring by adopting a scalable and practical system design, where reliability is prioritized over architectural complexity.
- Research Article
1
- 10.1145/3794845
- Mar 9, 2026
- ACM Computing Surveys
- Jiacheng Liu + 7 more
The emergence of large-scale Mixture of Experts (MoE) models represents a significant advancement in artificial intelligence, offering larger model capacity and computational efficiency through conditional computation. However, deploying and running inference on these models presents significant challenges in computational resources, latency, and energy efficiency. This comprehensive survey analyzes optimization techniques for MoE models across the entire system stack. We first establish a taxonomical framework that categorizes optimization approaches into model-level, system-level, and hardware-level optimizations. At the model level, we examine architectural innovations including efficient expert design, attention mechanisms, various compression techniques such as pruning, quantization, and knowledge distillation, as well as algorithm improvement including dynamic routing strategies and expert merging methods. At the system level, we investigate distributed computing approaches, load balancing mechanisms, and efficient scheduling algorithms that enable scalable deployment. Furthermore, we delve into hardware-specific optimizations and co-design strategies that maximize throughput and energy efficiency. This survey provides both a structured overview of existing solutions and identifies key challenges and promising research directions in MoE inference optimization. To facilitate ongoing updates and the sharing of cutting-edge advances in MoE inference optimization research, we have established a repository accessible at https://github.com/MoE-Inf/awesome-moe-inference/ .
- Research Article
- 10.3389/fmicb.2026.1791871
- Mar 9, 2026
- Frontiers in Microbiology
- Sihang Xu + 4 more
Environmental microorganism recognition from microscopic images is crucial for environmental monitoring and ecological analysis. In practical scenarios, microorganism categories often evolve over time, and newly emerging classes usually have only a few labeled samples due to high annotation costs. This combination naturally gives rise to the few-shot class-incremental learning (FSCIL) problem. FSCIL requires models to incrementally learn new classes under severe data scarcity while effectively retaining knowledge of previously learned ones. In this work, we propose a unified FSCIL framework for environmental microorganism recognition. The proposed method is composed of three complementary components. First, a contrastive-inspired fine-grained representation learning strategy is introduced in the base session. This strategy enhances intra-class compactness by mining prediction-consistent augmented samples, without introducing explicit contrastive losses. Second, a prototype rectification mechanism is designed to stabilize the representations of incremental classes by leveraging semantic structures learned from base classes. Third, a dual-graph knowledge distillation framework is proposed to preserve both instance-level and class-level relational knowledge during incremental learning. This process is guided by a teacher model updated via exponential moving average. Experiments conducted on the EMDS-7 dataset demonstrate the effectiveness of the proposed approach. Compared with state-of-the-art FSCIL methods, our method achieves the highest average accuracy of 78.19% and maintains the best final-session accuracy of 65.36%. Meanwhile, strong base-session performance is consistently preserved. These results indicate that the proposed framework effectively mitigates catastrophic forgetting and enables robust adaptation to new microorganism categories in real-world incremental recognition scenarios.
- Research Article
1
- 10.1145/3796229
- Mar 6, 2026
- ACM Transactions on Asian and Low-Resource Language Information Processing
- M'Hamed Amine Hatem + 2 more
Transformer-based models have revolutionized information retrieval, achieving state-of-the-art performance in document retrieval and ranking. For high-resource languages like English, an abundance of high-quality labeled datasets has facilitated the development of powerful models. However, developing powerful models for low-resource languages such as Arabic is challenging due to the scarcity of labeled data. While using translated English datasets can be considered to overcome the lack of labeled data, translated datasets have inherent information loss and inconsistencies introduced during the translation process. As a result, models fine-tuned on translated datasets typically underperform relative to their English counterparts. To address this issue, we explore the potential of transferring expertise from high-resource models to low-resource models. In particular, we investigate whether knowledge learned by English retrieval and reranking models can be effectively transferred to Arabic models via knowledge distillation. Our results demonstrate that knowledge distillation significantly improves the performance of Arabic information retrieval. Our models, fine-tuned using knowledge distillation on the mMARCO Arabic passage-ranking dataset, outperform state-of-the-art retrieval and reranker models. Specifically, our cross-encoder achieves an MRR@10 of 0.254, representing an 8% relative improvement over the previous best cross-encoder, mT5. In terms of recall, our bi-encoder achieves an R@1000 of 0.799, surpassing the late-interaction model mColBERT (R@1000 = 0.749, +6.7%) and the baseline BM25 (R@1000 = 0.637, +25%). Furthermore, by leveraging knowledge distillation with soft labels generated by an ensemble of IR models, we manage to achieve comparable or higher performance without requiring extensive manual annotation. This approach offers an effective mechanism for automatic annotation and pseudo-labeling in low-resource language scenarios.
- Research Article
- 10.1007/s10846-026-02379-9
- Mar 6, 2026
- Journal of Intelligent & Robotic Systems
- Hamze Hammami + 5 more
Abstract Vision offers richer context than traditional marine sensors (e.g., LiDAR, Doppler Velocity Logger (DVL), sonar) but is harder to interpret on water due to reflections, glare, and dynamic surfaces. SUSHI is a vision-first navigation system for Autonomous Surface Vehicles (ASVs) that fuses detection, water segmentation, and monocular depth to produce camera-centric navigation grids for planning and control. The proposed perception methods achieve 90% segmentation accuracy through knowledge distillation with SAM2 logits, requiring only 500-550 frames and approximately 30 minutes of training. The system implements a YOLO detection model that achieves 94.5% mAP@0.5 (F1 score: 0.91) for trash and obstacle detection in simulation, and benchmarks a monocular depth method that solves the issue of reflective surfaces and can work universally. Path planning uses a Multi-Field Synthesis (MFS) approach: a locally reactive artificial-potential-field component blended adaptively with a global wavefront flow field, mitigating local minima while preserving real-time responsiveness. A behavior layer prioritizes target seeking and mask-based visual exploration when explicit goals are absent. Validation was performed in the TOAST simulator and in a pool environment, demonstrating robust goal targeting and exploration using cameras with minimal side sensing for emergency avoidance.
- Research Article
- 10.1186/s13677-026-00872-y
- Mar 6, 2026
- Journal of Cloud Computing
- Bingyang Li + 4 more
Power-grid field operations demand real-time visual monitoring to verify personal protective equipment and tool usage under large depth-of-field. Conventional real-time detectors are efficient but closed-vocabulary; they struggle with rare or unseen objects. Large multimodal models (LMM) offer open-vocabulary understanding guided by prompts, yet are too heavy for edge deployment. To address these challenges, We propose an LMM-guided distillation framework that transfers prompt-grounded semantics from a large teacher to a lightweight YOLO-style student. The teacher, queried with expanded prompt set, produces pseudo labels and region–text embeddings. The student is trained with a standard detection objective and three semantic transfers. Firstly, feature distillation aligns student features to teacher region embeddings via a linear projector; Secondly, prompt-aware logit distillation matches student logits to the teacher’s temperature-smoothed prompt distribution; and thirdly, vision–language contrastive alignment ties projected student regions to the correct prompt embedding. Experiments on two benchmark dataset indicate consistent gains on both common and rare categories while retaining real-time throughput on edge hardware, demonstrating a practical cloud-to-edge pipeline for safety monitoring.
- Research Article
- 10.1038/s41598-026-42981-3
- Mar 5, 2026
- Scientific reports
- Danah Algawiaz
Transformer architectures and large language models remain competitive across a broad range of AI tasks, making them challenging to deploy in resource-constrained edge computing environments due to high resource demands and the generation of erroneous or fake outputs (hallucinations). In this paper, a single scheme, HALL-OPT, is proposed to address both latency detection and reduction in hallucination for real-time edge intelligence. The paper presents three main elements of the framework, namely, (1) a dual-stream hallucination detector that analyses internal attention behaviour, (2) an adaptive token-pruning system, which decodes and extracts the necessary context at minimal computation, and (3) a lightweight edge-optimized transformer obtained by knowledge distillation. On SQuAD 2.0 and CNN/DailyMail, HALL-OPT detects hallucinations accurately at 94.3% and achieves a 67.8% reduction in inference latency with only a 2.1% decrease in accuracy compared to the BERT-base model. The system (when deployed on edge hardware) provides sub-50 ms response times while consuming 43% less energy. It is appropriate for real-time applications in industrial IoT, autonomous systems, healthcare monitoring, and other applications where low latency is critical. Existing transformer optimisation and hallucination mitigation approaches treat reliability and Efficiency as separate objectives, limiting their applicability in real-time edge environments. HALL-OPT uniquely integrates hallucination-aware attention, adaptive pruning, and edge-oriented optimisation into a single unified framework, enabling simultaneous reductions in hallucination, latency, and energy consumption. This integrated design distinguishes HALL-OPT from prior work that optimises accuracy or Efficiency in isolation.