Sometimes Painful but Promising: Feasibility and Trade-Offs of On-Device Language Model Inference
The rapid rise of Language Models (LMs) has expanded the capabilities of natural language processing, powering applications from text generation to complex decision-making. While state-of-the-art LMs often boast hundreds of billions of parameters and are primarily deployed in data centers, recent trends show a growing focus on compact models—typically under 10 billion parameters—enabled by techniques such as quantization and other model compression techniques. This shift paves the way for LMs on edge devices, offering potential benefits such as enhanced privacy, reduced latency, and improved data sovereignty. However, the inherent complexity of even these smaller models, combined with the limited computing resources of edge hardware, raises critical questions about the practical trade-offs in executing LM inference outside the cloud. To address these challenges, we present a comprehensive evaluation of generative LM inference on representative CPU-based and GPU-accelerated edge devices. Our study measures key performance indicators—including memory usage, inference speed, and energy consumption—across various device configurations. Additionally, we examine throughput-energy trade-offs, cost considerations, and usability, alongside an assessment of qualitative model performance. While quantization helps mitigate memory overhead, it does not fully eliminate resource bottlenecks, especially for larger models. Our findings quantify the memory and energy constraints that must be considered for practical real-world deployments, offering concrete insights into the trade-offs between model size, inference performance, and efficiency. The exploration of LMs at the edge is still in its early stages. We hope this study provides a foundation for future research, guiding the refinement of models, the enhancement of inference efficiency, and the advancement of edge-centric AI systems.
- Research Article
16
- 10.1609/aaai.v38i17.29860
- Mar 24, 2024
- Proceedings of the AAAI Conference on Artificial Intelligence
Large Language Models (LLMs) stand out for their impressive performance in intricate language modeling tasks. However, their demanding computational and memory needs pose obstacles for broad use on edge devices. Quantization is then introduced to boost LLMs' on-device efficiency. Recent works show that 8-bit or lower weight quantization is feasible with minimal impact on end-to-end task performance, while the activation is still not quantized. On the other hand, mainstream commodity edge devices still struggle to execute these sub-8-bit quantized networks effectively. In this paper, we propose Agile-Quant, an Activation-Guided quantization framework for faster Inference of popular Large Language Models (LLMs) on the Edge. Considering the hardware profiling and activation analysis, we first introduce a basic activation quantization strategy to balance the trade-off of task performance and real inference speed. Then we leverage the activation-aware token pruning technique to reduce the outliers and the adverse impact on attentivity. Ultimately, we utilize the SIMD-based 4-bit multiplier and our efficient TRIP matrix multiplication to implement the accelerator for LLMs on the edge. We apply our framework on different scales of LLMs including LLaMA, OPT, and BLOOM with 4-bit or 8-bit for the activation and 4-bit for the weight quantization. Experiments show that Agile-Quant achieves simultaneous quantization of model weights and activations while maintaining task performance comparable to existing weight-only quantization methods. Moreover, in the 8- and 4-bit scenario, Agile-Quant achieves an on-device speedup of up to 2.55x compared to its FP16 counterparts across multiple edge devices, marking a pioneering advancement in this domain.
- Conference Article
19
- 10.1109/bdcloud.2018.00110
- Dec 1, 2018
The recent advances in deep neural networks (DNNs) make them attractive for embedded systems. However, it can take a long time for DNNs to make an inference on resource constrained computing devices. Model compression techniques can address the computation issue of deep inference on embedded devices. This technique is highly attractive, as it does not rely on specialized hardware, or computation-offloading that is often infeasible due to privacy concerns or high latency. However, it remains unclear how model compression techniques perform across a wide range of DNNs. To design efficient embedded deep learning solutions, we need to understand their behaviors. This work develops a quantitative approach to characterize model compression techniques on a representative embedded deep learning architecture, the NVIDIA Jetson Tx2. We perform extensive experiments by considering 11 influential neural network architectures from the image classification and the natural language processing domains. We experimentally show that how two mainstream compression techniques, data quantization and pruning, perform on these network architectures and the implications of compression techniques to the model storage size, inference time, energy consumption and performance metrics. We demonstrate that there are opportunities to achieve fast deep inference on embedded systems, but one must carefully choose the compression settings. Our results provide insights on when and how to apply model compression techniques and guidelines for designing efficient embedded deep learning systems.
- Research Article
4
- 10.1145/3767742
- Nov 18, 2025
- ACM Transactions on Internet of Things
Deploying Large Language Models (LLMs) on edge devices presents significant challenges due to computational constraints, memory limitations, inference speed, and energy consumption. Model quantization has emerged as a key technique to enable efficient LLM inference by reducing model size and computational overhead. In this study, we conduct a comprehensive analysis of 28 quantized LLMs from the Ollama library, which applies by default Post-Training Quantization (PTQ) and weight-only quantization techniques, deployed on an edge device (Raspberry Pi 4 with 4 GB RAM). We evaluate energy efficiency, inference performance, and output accuracy across multiple quantization levels and task types. Models are benchmarked on five standardized datasets (CommonsenseQA, BIG-Bench Hard, TruthfulQA, GSM8K, and HumanEval), and we employ a high-resolution, hardware-based energy measurement tool to capture real-world power consumption. Our findings reveal the trade-offs between energy efficiency, inference speed, and accuracy in different quantization settings, highlighting configurations that optimize LLM deployment for resource-constrained environments. By integrating hardware-level energy profiling with LLM benchmarking, this study provides actionable insights for sustainable AI, bridging a critical gap in existing research on energy-aware LLM deployment.
- Research Article
- 10.32628/cseit241061100
- Nov 18, 2024
- International Journal of Scientific Research in Computer Science, Engineering and Information Technology
This comprehensive article explores the cutting-edge techniques and challenges associated with on-device inference of Large Language Models (LLMs), a transformative approach that brings advanced AI capabilities directly to mobile and edge devices. The article delves into the intricate balance between the computational demands of LLMs and the resource constraints of mobile hardware, presenting a detailed analysis of various strategies to overcome these limitations. Key areas of focus include model compression techniques such as pruning and knowledge distillation, quantization methods, and the development of efficient model architectures. The article also examines the role of specialized hardware accelerators, including Neural Processing Units (NPUs), FPGAs, and ASICs, in enhancing on-device performance. Additionally, the article addresses critical aspects of memory management and optimization strategies crucial for efficient LLM deployment. Through a rigorous evaluation of performance metrics, the article offers insights into the trade-offs between model size, inference speed, and accuracy. It further explores diverse applications and use cases, from real-time language translation to privacy-preserving text analysis, highlighting the transformative potential of on-device LLM inference. The article concludes with an examination of ongoing challenges and future research directions, including improving energy efficiency, enhancing model adaptability, and addressing privacy and security concerns. This comprehensive article provides researchers, developers, and industry professionals with a thorough understanding of the current state and future prospects of on-device LLM inference, underlining its significance in shaping the next generation of AI-powered mobile and IoT applications.
- Research Article
11
- 10.3390/electronics11050682
- Feb 23, 2022
- Electronics
Precise monitoring of respiratory rate in premature newborn infants is essential to initiating medical interventions as required. Wired technologies can be invasive and obtrusive to the patients. We propose a deep-learning-enabled wearable monitoring system for premature newborn infants, where respiratory cessation is predicted using signals that are collected wirelessly from a non-invasive wearable Bellypatch put on the infant’s body. We propose a five-stage design pipeline involving data collection and labeling, feature scaling, deep learning model selection with hyperparameter tuning, model training and validation, and model testing and deployment. The model used is a 1-D convolutional neural network (1DCNN) architecture with one convolution layer, one pooling layer, and three fully-connected layers, achieving 97.15% classification accuracy. To address the energy limitations of wearable processing, several quantization techniques are explored, and their performance and energy consumption are analyzed for the respiratory classification task. Results demonstrate a reduction of energy footprints and model storage overhead with a considerable degradation of the classification accuracy, meaning that quantization and other model compression techniques are not the best solution for respiratory classification problem on wearable devices. To improve accuracy while reducing the energy consumption, we propose a novel spiking neural network (SNN)-based respiratory classification solution, which can be implemented on event-driven neuromorphic hardware platforms. To this end, we propose an approach to convert the analog operations of our baseline trained 1DCNN to their spiking equivalent. We perform a design-space exploration using the parameters of the converted SNN to generate inference solutions having different accuracy and energy footprints. We select a solution that achieves an accuracy of 93.33% with 18× lower energy compared to the baseline 1DCNN model. Additionally, the proposed SNN solution achieves similar accuracy as the quantized model with a 4× lower energy.
- Research Article
35
- 10.1109/tcds.2016.2550591
- Sep 1, 2016
- IEEE Transactions on Cognitive and Developmental Systems
Human infants can discover words directly from unsegmented speech signals without any explicitly labeled data. In this paper, we develop a novel machine learning method called nonparametric Bayesian double articulation analyzer (NPB-DAA) that can directly acquire language and acoustic models from observed continuous speech signals. For this purpose, we propose an integrative generative model that combines a language model and an acoustic model into a single generative model called the "hierarchical Dirichlet process hidden language model" (HDP-HLM). The HDP-HLM is obtained by extending the hierarchical Dirichlet process hidden semi-Markov model (HDP-HSMM) proposed by Johnson et al. An inference procedure for the HDP-HLM is derived using the blocked Gibbs sampler originally proposed for the HDP-HSMM. This procedure enables the simultaneous and direct inference of language and acoustic models from continuous speech signals. Based on the HDP-HLM and its inference procedure, we developed a novel double articulation analyzer. By assuming HDP-HLM as a generative model of observed time series data, and by inferring latent variables of the model, the method can analyze latent double articulation structure, i.e., hierarchically organized latent words and phonemes, of the data in an unsupervised manner. The novel unsupervised double articulation analyzer is called NPB-DAA. The NPB-DAA can automatically estimate double articulation structure embedded in speech signals. We also carried out two evaluation experiments using synthetic data and actual human continuous speech signals representing Japanese vowel sequences. In the word acquisition and phoneme categorization tasks, the NPB-DAA outperformed a conventional double articulation analyzer (DAA) and baseline automatic speech recognition system whose acoustic model was trained in a supervised manner.
- Video Transcripts
- 10.48448/k1sz-yn78
- May 11, 2022
Language models excel at generating coherent text, and model compression techniques such as knowledge distillation have enabled their use in resource-constrained settings. However, these models can be biased in multiple ways, including the unfounded association of male and female genders with gender-neutral professions. Therefore, knowledge distillation without any fairness constraints may preserve or exaggerate the teacher model’s biases onto the distilled model. To this end, we present a novel approach to mitigate gender disparity in text generation by learning a fair model during knowledge distillation. We propose two modifications to the base knowledge distillation based on counterfactual role reversal—modifying teacher probabilities and augmenting the training set. We evaluate gender polarity across professions in open-ended text generated from the resulting distilled and finetuned GPT–2 models and demonstrate a substantial reduction in gender disparity with only a minor compromise in utility. Finally, we observe that language models that reduce gender polarity in language generation do not improve embedding fairness or downstream classification fairness.
- Conference Article
6
- 10.1109/infoteh51037.2021.9400658
- Mar 17, 2021
Digitalization and automation have been the driving force of the fourth industrial revolution and with them comes a vast amount of data that is collected and needs to be processed. Because of the high costs involved with operating data centers, the task of processing this data has been moved onto edge devices, which process the data on the spot and can act independently. In this paper, an accelerator for artificial neural network inference on edge devices, namely the Intel Neural Compute Stick (NCS), will be evaluated in terms of speed, energy consumption, and performance/cost ratio after which it will be compared to other similar solutions. It has been concluded that the inference speed is significantly better by using the Intel NCS than without it. This solution can be used in private projects and prototyping because of its low cost and low power consumption.
- Conference Article
1
- 10.1109/hpcc-smartcity-dss50907.2020.00167
- Dec 1, 2020
Convolutional Neural Network (CNN) optimization is critical to reduce the inference latency on computing devices like CPU and GPU. The most important step in efficiently executing these algorithms involves combining multiple operations within a single convolutional layer through a process called Layer Fusion. We first form a correlation between Layer Fusion and data management on computing platforms like CPUs and GPUs, and analyze its significance for different networks under different model compression techniques, e.g., Pruning and Quantization. Weight Pruning removes redundant parameters in the network thereby shrinking the model size. This method, however, creates sparse matrices which can hamper the performance on CPU and GPU devices. Several node/filter pruning algorithms have been developed to resolve the bottlenecks of irregular pruning and reduce the inference time. Although symmetric pruning techniques reduce the forward path computation time, the weights and activations are still executed in floating point precision which can be further optimized by Quantization. This method reduces the bit-width of individual CNN parameter from high precision (Float32) to lower precision (Int8). Even though several model compression (pruning and quantization) techniques have been developed to improve performance, their integrated study with Layer Fusion has not been performed. In today’s scenario, not all CPUs and GPUs explicitly support Int8 multiplication. Hence, we analyze the performance of optimized implementation of Int8 Quantization on such devices like Intel’s Skylake and Nvidia’s Tesla V100. We develop a novel Node Pruning algorithm to remove redundant filters which can aid in efficient implementation of Layer Fusion/Quantized Networks. We compare the execution time of our combined pruning and quantization implementation with a traditional node pruning algorithm which achieved a mean speedup of 3.5 times on Skylake CPU.
- Research Article
1
- 10.1098/rsta.2023.0395
- Jan 16, 2025
- Philosophical transactions. Series A, Mathematical, physical, and engineering sciences
Modern language models such as bidirectional encoder representations from transformers have revolutionized natural language processing (NLP) tasks but are computationally intensive, limiting their deployment on edge devices. This paper presents an energy-efficient accelerator design tailored for encoder-based language models, enabling their integration into mobile and edge computing environments. A data-flow-aware hardware accelerator design for language models inspired by Simba, makes use of approximate fixed-point POSIT-based multipliers and uses high bandwidth memory (HBM) in achieving significant improvements in computational efficiency, power consumption, area and latency compared to the hardware-realized scalable accelerator Simba. Compared to Simba, AxLaM achieves a ninefold energy reduction, 58% area reduction and 1.2 times improved latency, making it suitable for deployment in edge devices. The energy efficiency of AxLaN is 1.8 TOPS/W, 65% higher than FACT, which requires pre-processing of the language model before implementing it on the hardware.This article is part of the theme issue 'Emerging technologies for future secure computing platforms'.
- Abstract
- 10.1093/bib/bbaf631.035
- Dec 12, 2025
- Briefings in Bioinformatics
dimplek0424@gmail.com, sanjan.tp@biotech.iitm.ac.inMedical imaging workflows integrate radiology images with their corresponding free-text reports. Large language models (LLMs) and large vision–language models (LVLMs) achieve strong results but face deployment barriers in hospitals due to computational demands, privacy risks and infrastructure needs. Small language models (SLMs) and small vision–language models (SVLMs), typically under 10 billion parameters, provide a more efficient and auditable alternative for on-premise, privacy-preserving applications in radiology. Recent advancements, including CheXzero, MedCLIP, XrayGPT, LLaVA-Med, MedFILIP and MedBridge, show that smaller multimodal models support classification, retrieval and report generation. Complementary baselines from lightweight SLMs such as DistilBERT, TinyBERT, BioClinicalBERT and T5-Small highlight opportunities for radiology report understanding.Building on these efforts, we propose a reproducible evaluation framework anchored on IU-CXR (for Indiana University Chest X-ray dataset), with potential extensions to CT, MRI and ophthalmology datasets. Our framework integrates task metrics such as ROUGE, F1-score and AUROC, together with efficiency measures including VRAM usage, latency, and model size; alongside trust dimensions like factuality, bias, and robustness. We also conduct ablation studies on model architecture, tokenizers and parameter-efficient fine-tuning (e.g., qLoRA), while analyzing trade-offs between accuracy, efficiency, and stability. This work establishes reproducible baselines and guidance for deploying radiology AI, while also advancing open-source research (available at https://github.com/dimplek0424/MedVLMBenchPhase1).MotivationMedical vision–language models (Med-VLMs) combine image recognition and natural language processing to automate radiology workflows, enabling tasks such as report generation, classification and structured knowledge extraction. While LVLMs have demonstrated impressive performance, their adoption in hospitals is limited by high computational demands, privacy concerns and reliance on cloud-scale infrastructure [1–2]. This motivates exploration of SLMs and SVLMs, which typically contain fewer than 10 billion parameters. These smaller models can be efficiently deployed in privacy-sensitive and resource-constrained clinical environments. They offer benefits in terms of interpretability, cost-efficiency and compliance. However, systematic benchmarking is necessary to clarify the trade-offs between accuracy, efficiency, and trustworthiness.Representative modelsRecent advancements indicate the potential of smaller multimodal systems in the medical domain. For instance, MedCLIP utilizes contrastive pretraining on unpaired images and reports impressive results [5]. XrayGPT integrates MedCLIP encoders with large language model (LLM) decoders for tasks such as summarization and question answering [6]. MedFILIP introduces fine-grained triplet supervision to effectively capture rare findings [8], while MedBridge employs lightweight adapters to adapt frozen encoders, facilitating efficient benchmarking [9]. Additionally, complementary approaches like CheXzero [4] and LLaVA-Med [7] demonstrate the potential for zero-shot classification and multimodal retrieval within radiology datasets. Lightweight SLMs, such as DistilBERT-66 M, TinyBERT-14 M, BioClinicalBERT-110 M and T5-Small-60 M [10], provide efficient baselines for report summarization, labeling, and entity extraction. While not multimodal, these models are essential comparators for assessing the added value of vision–language integration.MethodologyThis study proposes a reproducible benchmarking framework anchored on IU-CXR [3], a publicly available chest radiograph dataset with paired reports. This framework can also be extended to include CT, MRI, and ophthalmology datasets to assess generalization. It evaluates three primary application tracks: (1) zero-shot classification – testing whether models can detect pathologies without task-specific fine-tuning (e.g., CheXpert labels); (2) multimodal retrieval – matching radiology images to their corresponding reports; and (3) report summarization and entity extraction – condensing free-text findings into concise impressions and identifying structured entities, in alignment with RadGraph annotations. The evaluation rubric integrates task-level metrics (AUROC for classification, ROUGE for summarization and F1 score for extraction) with efficiency measures (VRAM footprint, inference latency and model size). Important trust dimensions, such as factual accuracy, calibration error and robustness to minor perturbations, are also taken into account. To explore trade-offs between efficiency and accuracy, we include studies on tokenizer choice – comparing specialized versus generic vocabularies for radiology [11]; parameter-efficient fine-tuning – using adapters like qLoRA [12–13]; and quantization – implementing 8-bit and 4-bit inference to reduce memory usage without significant loss of stability. This design enables standardized comparisons across encoder-only, encoder–decoder, and decoder-only architectures, facilitating fair benchmarking of both SLMs and SVLMs.ChallengesCurrent benchmarks tend to overemphasize chest X-rays, limiting evidence of generalization to CT, MRI and ophthalmology datasets. Pre-processing pipelines such as image normalization, label extraction, and metadata harmonization are applied inconsistently, making fair model comparisons difficult. Furthermore, efficiency strategies like quantization and LoRA may introduce a dip in precision if not systematically tuned. Trustworthiness remains underexplored; SVLMs often struggle with rare pathologies, calibration and factual accuracy. Additionally, few evaluations compare encoder-only, encoder–decoder and decoder-only architectures, leaving open questions about their relative reliability in clinical settings.ConclusionThis work systematically benchmarks open-source SLMs and SVLMs, providing reproducible baselines that balance accuracy with deployment constraints. By integrating efficiency, performance and trust into a single framework, it offers practical guidance for hospital-ready, privacy-preserving AI systems. Beyond radiology, this approach contributes to standardized evaluation practices for lightweight multimodal models, bridging the gap between algorithmic advancements and clinical deployment.
- Abstract
1
- 10.1093/bib/bbaf631.077
- Dec 12, 2025
- Briefings in Bioinformatics
Medical imaging workflows integrate radiology images with their corresponding free-text reports. Large language models (LLMs) and large vision–language models (LVLMs) achieve strong results but face deployment barriers in hospitals due to computational demands, privacy risks and infrastructure needs. Small language models (SLMs) and small vision–language models (SVLMs), typically under 10 billion parameters, provide a more efficient and auditable alternative for on-premise, privacy-preserving applications in radiology. Recent advancements, including CheXzero, MedCLIP, XrayGPT, LLaVA-Med, MedFILIP and MedBridge, show that smaller multimodal models support classification, retrieval and report generation. Complementary baselines from lightweight SLMs such as DistilBERT, TinyBERT, BioClinicalBERT and T5-Small highlight opportunities for radiology report understanding.Building on these efforts, we propose a reproducible evaluation framework anchored on IU-CXR (for Indiana University Chest X-ray dataset), with potential extensions to CT, MRI and ophthalmology datasets. Our framework integrates task metrics such as ROUGE, F1-score and AUROC, together with efficiency measures including VRAM usage, latency, and model size; alongside trust dimensions like factuality, bias, and robustness. We also conduct ablation studies on model architecture, tokenizers and parameter-efficient fine-tuning (e.g. qLoRA), while analyzing trade-offs between accuracy, efficiency, and stability. This work establishes reproducible baselines and guidance for deploying radiology AI, while also advancing open-source research (available at https://github.com/dimplek0424/MedVLMBenchPhase1).MotivationMedical vision–language models (Med-VLMs) combine image recognition and natural language processing to automate radiology workflows, enabling tasks such as report generation, classification and structured knowledge extraction. While LVLMs have demonstrated impressive performance, their adoption in hospitals is limited by high computational demands, privacy concerns and reliance on cloud-scale infrastructure [1–2]. This motivates exploration of SLMs and SVLMs, which typically contain fewer than 10 billion parameters. These smaller models can be efficiently deployed in privacy-sensitive and resource-constrained clinical environments. They offer benefits in terms of interpretability, cost-efficiency and compliance. However, systematic benchmarking is necessary to clarify the trade-offs between accuracy, efficiency, and trustworthiness.Representative modelsRecent advancements indicate the potential of smaller multimodal systems in the medical domain. For instance, MedCLIP utilizes contrastive pretraining on unpaired images and reports impressive results [5]. XrayGPT integrates MedCLIP encoders with large language model (LLM) decoders for tasks such as summarization and question answering [6]. MedFILIP introduces fine-grained triplet supervision to effectively capture rare findings [8], while MedBridge employs lightweight adapters to adapt frozen encoders, facilitating efficient benchmarking [9]. Additionally, complementary approaches like CheXzero [4] and LLaVA-Med [7] demonstrate the potential for zero-shot classification and multimodal retrieval within radiology datasets. Lightweight SLMs, such as DistilBERT-66M, TinyBERT-14M, BioClinicalBERT-110M and T5-Small-60M [10], provide efficient baselines for report summarization, labeling, and entity extraction. While not multimodal, these models are essential comparators for assessing the added value of vision–language integration.MethodologyThis study proposes a reproducible benchmarking framework anchored on IU-CXR [3], a publicly available chest radiograph dataset with paired reports. This framework can also be extended to include CT, MRI, and ophthalmology datasets to assess generalization. It evaluates three primary application tracks: (1) zero-shot classification – testing whether models can detect pathologies without task-specific fine-tuning (e.g. CheXpert labels); (2) multimodal retrieval – matching radiology images to their corresponding reports; and (3) report summarization and entity extraction – condensing free-text findings into concise impressions and identifying structured entities, in alignment with RadGraph annotations. The evaluation rubric integrates task-level metrics (AUROC for classification, ROUGE for summarization and F1 score for extraction) with efficiency measures (VRAM footprint, inference latency and model size). Important trust dimensions, such as factual accuracy, calibration error and robustness to minor perturbations, are also taken into account. To explore trade-offs between efficiency and accuracy, we include studies on tokenizer choice – comparing specialized versus generic vocabularies for radiology [11]; parameter-efficient fine-tuning – using adapters like qLoRA [12–13]; and quantization – implementing 8-bit and 4-bit inference to reduce memory usage without significant loss of stability. This design enables standardized comparisons across encoder-only, encoder–decoder, and decoder-only architectures, facilitating fair benchmarking of both SLMs and SVLMs.ChallengesCurrent benchmarks tend to overemphasize chest X-rays, limiting evidence of generalization to CT, MRI and ophthalmology datasets. Pre-processing pipelines such as image normalization, label extraction, and metadata harmonization are applied inconsistently, making fair model comparisons difficult. Furthermore, efficiency strategies like quantization and LoRA may introduce a dip in precision if not systematically tuned. Trustworthiness remains underexplored; SVLMs often struggle with rare pathologies, calibration and factual accuracy. Additionally, few evaluations compare encoder-only, encoder–decoder and decoder-only architectures, leaving open questions about their relative reliability in clinical settings.ConclusionThis work systematically benchmarks open-source SLMs and SVLMs, providing reproducible baselines that balance accuracy with deployment constraints. By integrating efficiency, performance and trust into a single framework, it offers practical guidance for hospital-ready, privacy-preserving AI systems. Beyond radiology, this approach contributes to standardized evaluation practices for lightweight multimodal models, bridging the gap between algorithmic advancements and clinical deployment.
- Research Article
- 10.3390/electronics14040775
- Feb 17, 2025
- Electronics
Recent research has explored combining large language models (LLMs) with speech recognition for various services, but such applications require a strong network environment for quality service delivery. For on-device services, which do not rely on networks, resource limitations must be considered. This study proposes HYLR-FO, an efficient model that integrates a smaller language model (LM) and a rule-based system (RBS) to enable fast and reliable voice-based order processing in resource-constrained environments, approximating the performance of LLMs. By considering potential error scenarios and leveraging flexible natural language processing (NLP) and inference validation, this approach ensures both efficiency and robustness in order execution. Smaller LMs are used instead of LLMs to reduce resource usage. The LM transforms speech input, received via automatic speech recognition (ASR), into a consistent form that can be processed by the RBS. The RBS then extracts the order and validates the extracted information. The experimental results show that HYLR-FO, trained and tested on 5000 order data samples, achieves up to 86% accuracy, comparable to the 90% accuracy of LLMs. Additionally, HYLR-FO achieves a processing speed of up to 55 orders per second, significantly outperforming LLM-based approaches, which handle only 1.14 orders per second. This results in a 48.25-fold improvement in processing speed in resource-constrained environments. This study demonstrates that HYLR-FO provides faster processing and achieves accuracy similar to LLMs in resource-constrained on-device environments. This finding has theoretical implications for optimizing LM efficiency in constrained settings and practical implications for real-time low-resource AI applications. Specifically, the design of HYLR-FO suggests its potential for efficient deployment in various commercial environments, achieving fast response times and low resource consumption with smaller models.
- Research Article
3
- 10.1109/tpami.2025.3582000
- Jan 1, 2025
- IEEE transactions on pattern analysis and machine intelligence
Recent advancements in language models have demonstrated its capacity of context understanding and generative representations. Leveraged by these developments, we propose a novel multimodal trajectory predictor based on a vision-language model, named VLMTraj, which fully takes advantage of the prior knowledge of multimodal large language models and the human-like reasoning across diverse modality information. The key idea of our model is to reframe the trajectory prediction task into a visual question answering format, using historical information as context and instructing the language model to make predictions in a conversational manner. Specifically, we transform all the inputs into a natural language style: historical trajectories are converted into text prompts, and scene images are described through image captioning. Additionally, visual features from input images are also transformed into tokens via a modality encoder and connector. The transformed data is then formatted to be used in a language model. Next, in order to guide the language model in understanding and reasoning high-level knowledge, such as scene context and social relationships between pedestrians, we introduce an auxiliary multi-task question and answers. For training, we first optimize a numerical tokenizer with the prompt data to effectively separate integer and decimal parts, allowing us to capture correlations between consecutive numbers in the language model. We then train our language model using all the visual question answering prompts. During model inference, we implement both deterministic and stochastic prediction methods through beam-search-based most-likely prediction and temperature-based multimodal generation. Our VLMTraj validates that the language-based model can be a powerful pedestrian trajectory predictor, and outperforms existing numerical-based predictor methods. Extensive experiments show that VLMTraj can successfully understand social relationships and accurately extrapolate the multimodal futures on public pedestrian trajectory prediction benchmarks.
- Research Article
1
- 10.1016/j.neunet.2025.107855
- Nov 1, 2025
- Neural networks : the official journal of the International Neural Network Society
PT-BitNet: Scaling up the 1-Bit large language model with post-training quantization.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.