Discovery Logo
Sign In
Search
Paper
Search Paper
R Discovery for Libraries Pricing Sign In
  • Home iconHome
  • My Feed iconMy Feed
  • Search Papers iconSearch Papers
  • Library iconLibrary
  • Explore iconExplore
  • Ask R Discovery iconAsk R Discovery Star Left icon
  • Literature Review iconLiterature Review NEW
  • Chat PDF iconChat PDF Star Left icon
  • Citation Generator iconCitation Generator
  • Chrome Extension iconChrome Extension
    External link
  • Use on ChatGPT iconUse on ChatGPT
    External link
  • iOS App iconiOS App
    External link
  • Android App iconAndroid App
    External link
  • Contact Us iconContact Us
    External link
  • Paperpal iconPaperpal
    External link
  • Mind the Graph iconMind the Graph
    External link
  • Journal Finder iconJournal Finder
    External link
Discovery Logo menuClose menu
  • Home iconHome
  • My Feed iconMy Feed
  • Search Papers iconSearch Papers
  • Library iconLibrary
  • Explore iconExplore
  • Ask R Discovery iconAsk R Discovery Star Left icon
  • Literature Review iconLiterature Review NEW
  • Chat PDF iconChat PDF Star Left icon
  • Citation Generator iconCitation Generator
  • Chrome Extension iconChrome Extension
    External link
  • Use on ChatGPT iconUse on ChatGPT
    External link
  • iOS App iconiOS App
    External link
  • Android App iconAndroid App
    External link
  • Contact Us iconContact Us
    External link
  • Paperpal iconPaperpal
    External link
  • Mind the Graph iconMind the Graph
    External link
  • Journal Finder iconJournal Finder
    External link
features
  • Audio Papers iconAudio Papers
  • Paper Translation iconPaper Translation
  • Chrome Extension iconChrome Extension
Content Type
  • Journal Articles iconJournal Articles
  • Conference Papers iconConference Papers
  • Preprints iconPreprints
  • Seminars by Cassyni iconSeminars by Cassyni
More
  • R Discovery for Libraries iconR Discovery for Libraries
  • Research Areas iconResearch Areas
  • Topics iconTopics
  • Resources iconResources

Articles published on Word error rate

Authors
Select Authors
Journals
Select Journals
Duration
Select Duration
1745 Search results
Sort by
Recency
  • New
  • Research Article
  • 10.22214/ijraset.2026.80691
End-to-End Handwritten Malayalam to English Translation: A Deep Learning Implementation
  • Apr 30, 2026
  • International Journal for Research in Applied Science and Engineering Technology
  • Mohammed Farhan

Bridging the gap between handwritten regional language documents and automated English translation remains a genuinely difficult problem—particularly for morphologically complex, low-resource scripts like Malayalam. The challenge goes beyond simple recognition: handwritten Malayalam exhibits tightly coupled ligatures, circular stroke patterns, and high interwriter variability that together defeat most off-the-shelf OCR tools. This paper describes an end-to-end deep learning pipeline we built and deployed to address exactly this problem. The architecture works in four stages: a fine-tuned YOLOv8 model localizes individual handwritten words, a custom ResNetCRNN with Bidirectional LSTMs and CTC decoding performs character-level recognition, a KenLM language model combined with SymSpell post-processing corrects phonetic ambiguities, and Meta’s NLLB-200 transformer handles the final Malayalam-toEnglish translation. The system is delivered as a containerized web application built on FastAPI and React, supporting real-time inference with asynchronous batch processing. Evaluated on a robust test set of 19,680 handwritten samples, the OCR component achieved a Character Error Rate (CER) of 1.20% and a Word Error Rate (WER) of 7.30%, with 92.7% of predictions being exact matches. These results suggest the pipeline is practically viable for digitizing and translating unconstrained handwritten Malayalam at scale

  • New
  • Research Article
  • 10.3390/electronics15081718
An AI-Based Security Architecture for Fraud Detection in Cloud Call Centers for Low-Resource Languages: Arabic as a Use Case
  • Apr 18, 2026
  • Electronics
  • Pinar Boluk + 1 more

Cloud-based telephony platforms face growing fraud risks including voice phishing (vishing), subscription abuse, and organizational impersonation, with detection being especially challenging in low-resource languages such as Arabic. We present an Artificial Intelligence (AI)-based security architecture for fraud detection in Arabic cloud call centers, combining onboarding verification, behavioral monitoring, domain-adapted Automatic Speech Recognition (ASR), semantic transcript search, and Large Language Model (LLM)-based entity verification. The domain-adapted Langa ASR model achieves a Word Error Rate (WER) of 41.0% and Character Error Rate (CER) of 18.2%, outperforming all evaluated commercial baselines. LLM-based entity extraction with multi-call consensus achieves 97.3% company-name accuracy (Generative Pre-trained Transformer 4, GPT-4) and 92.0% in the cost-effective deployed configuration (GPT-3.5 with log-probability filtering). Evaluated on production data from a Middle East and North Africa (MENA)-region provider spanning more than 1000 accounts, the pipeline flagged 47 accounts of which 41 were confirmed fraudulent (directly observed precision 87.2%, 95% confidence interval (CI): 74.3–95.2%; estimated recall 51–82% under conservative base-rate assumptions—not directly measured), providing evidence for the viability of a unified, threat-model-driven architecture for low-resource telephony fraud detection.

  • Research Article
  • 10.1038/s41598-026-46572-0
Enhancing Hindi OCR robustness with CRNN-ResNet50: a data augmentation approach on the devanagari dataset.
  • Apr 11, 2026
  • Scientific reports
  • Sujeet Kumar + 3 more

Optical character recognition is a technology that turns texts and scanned documents into digital formats. The practical applications of OCR systems face a lot of challenges because of heterogeneity in scripts, font styles, and different quality images arising during their usage. The proposed work attempts to improve the robustness of OCR models built for the Hindi language by addressing the deficiencies inherent in the widely used IIIT-HW-Dev dataset. Toward this end, two methods of data augmentation are proposed: 1) synthetic images with half characters and conjuncts for complex words, and 2) various image degradations to simulate the real world. The augmented training set is used to train the two deep architectures, namely Convolutional Recurrent Neural Networks (CRNNs) and ResNet-50. Comparisons and in-depth experiments further reveal the merits of the suggested data augmentation techniques: the top performance obtained by the CRNN model exceeds current approaches. On the test set, CRNN achieved a Character Error Rate (CER) of 2.14% and Word Error Rate (WER) of 7.96%, much beyond the CER of 3.27% and WER of 11.82% achieved by the ResNet-50 model. Qualitatively, there was robustness in the case of complex words and images that were also deteriorated. The augmented dataset and trained models can be downloaded by anyone to promote more research in Hindi OCR. So, developing robust OCR systems for complex scripts starts from variability-rich and representative training data along with specific architectures, such as the CRNN architecture.

  • Research Article
  • 10.32877/bt.v8i3.3473
NLP Based Tourism Service Optimization on Multilingual Voice Chatbot
  • Apr 10, 2026
  • bit-Tech
  • Muhammad Jundullah + 4 more

The rapid advancement of artificial intelligence has created significant opportunities to enhance tourism services, a vital sector of Indonesia’s economy, particularly in Raja Ampat as a leading ecotourism destination. (R1-1 Background) A persistent challenge in this region is effective communication for homestay management, where limited human resources and linguistic diversity constrain service quality. (R1-1 Methodology) This study evaluates a multilingual voice chatbot integrating Natural Language Processing, Large Language Models, and a Retrieval-Augmented Generation architecture, supporting Indonesian, English, and a virtual local language. System performance is quantitatively assessed using Speech-to-Text accuracy measured by Word Error Rate, intent classification metrics, semantic retrieval effectiveness, and end-to-end evaluation. The proposed pipeline includes speech data collection, text normalization, multilingual embedding, vector storage, semantic retrieval, and response generation. (R1-2 Results) Results show that STT quality strongly determines downstream performance. Indonesian achieves the lowest WER (0.14) and the highest intent F1-score (0.89), while the virtual language records the highest WER (0.25) and the lowest intent F1-score (0.65). The semantic retriever attains a Mean Average Precision of 0.55, indicating moderate document ranking quality. The integrated end-to-end system achieves an F1-score of 0.857 with a user satisfaction score of 4.4. (A-1 Contribution) Compared with existing tourism chatbots, the proposed system uniquely combines multilingual voice interaction with RAG-based grounding to improve response reliability in low-resource settings. (A-2 Conclusion and applicability) These findings demonstrate practical effectiveness for homestay services and highlight scalability to other multilingual tourism regions in Indonesia and beyond.

  • Research Article
  • 10.70609/g-tech.v10i2.9342
Privacy-Focused AIoT: Implementing an Offline Voice Assistant for Smart Building Management Using Local LLMs
  • Apr 4, 2026
  • G-Tech: Jurnal Teknologi Terapan
  • Fitri Wibowo + 3 more

Voice assistants are increasingly used for smart building control, yet cloud-based architectures raise privacy risks and become unavailable during internet outages. This study designs and evaluates a fully offline AIoT voice assistant for smart building management using local speech and language models. The system employs an edge audio node (Raspberry Pi Zero 2W with ReSpeaker 2-Mics Pi HAT) and a local GPU server running containerized microservices for speech-to-text (Whisper), intent understanding and action planning (Ollama-hosted LLMs), and text-to-speech (Piper). Building devices and sensors are integrated through Home Assistant, enabling voice-driven control and monitoring without sending audio or interaction logs to external services. Experiments in a laboratory smart-building testbed evaluate speech recognition robustness under varying noise levels, LLM command understanding accuracy and memory footprint, and end-to-end IoT task execution. The speech subsystem achieves a Word Error Rate of 5–20% depending on background noise. Across 33 IoT entities, the assistant reaches a 96.67% execution success rate with an average response time of 5.5 s. Among the evaluated local models, Qwen3 8B achieves the highest intent-to-action accuracy (Acc_I2A=100% on an oracle-text command test set with N=43) with 6.8 GB memory use. The results demonstrate that privacy-preserving and resilient voice interaction for smart building management is feasible using current local LLM stacks.

  • Research Article
  • 10.1016/j.eswa.2025.130670
SAM: A privacy-preserving framework for selective attribute masking in voice recordings
  • Apr 1, 2026
  • Expert Systems with Applications
  • Anil Pudasaini + 4 more

• Introduce Selective Attribute Masking (SAM) for voice privacy. • Use adversarial perturbations to mask age, gender, and accent. • Propose a composite loss for selective masking in multi-head models. • Achieve up to 74.5. • Balance privacy protection with speech recognition utility. Human voice is a rich source of information that can reveal a range of sensitive personal attributes, such as age, gender, and country of origin. With advances in Artificial Intelligence (AI), especially in speech processing, these personal attributes can now be inferred on a scale with high accuracy, raising serious privacy concerns. In fact, the ability to extract demographic or identification information from voice data poses risks related to surveillance, profiling, and misuse of personal data, highlighting the urgent need for privacy-preserving solutions in voice-based AI systems. Therefore, this paper proposes Selective Attribute Masking (SAM) , a new model-agnostic framework that uses gradient-based adversarial perturbations to suppress the inference of specific speaker attributes from voice recordings, while preserving the accuracy of non-target attributes and maintaining the utility of Automatic Speech Recognition (ASR). Experimental results using CommonVoice dataset demonstrate that SAM achieves selective masking success rates of up to 74.5 % for age, 59.4 % for gender, and 54.6 % for accent–substantially outperforming baseline methods. At the same time, voice utility (that is, ASR) remains largely unaffected, with the word error rate increasing by less than 3 % absolute under moderate perturbations. These findings demonstrate the effectiveness of our proposed framework (SAM) in balancing privacy and utility in voice-based systems.

  • Research Article
  • 10.1177/2167647x251411174
Advancing Dysarthric Speech-to-Text Recognition with LATTE: A Low-Latency Acoustic Modeling Approach for Real-Time Communication.
  • Apr 1, 2026
  • Big data
  • Qurat Ul Ain + 4 more

Dysarthria, a motor speech disorder characterized by slurred and often unintelligible speech, presents substantial challenges for effective communication. Conventional automatic speech recognition systems frequently underperform on dysarthric speech, particularly in severe cases. To address this gap, we introduce low-latency acoustic transcription and textual encoding (LATTE), an advanced framework designed for real-time dysarthric speech recognition. LATTE integrates preprocessing, acoustic processing, and transcription mapping into a unified pipeline, with its core powered by a hybrid architecture that combines convolutional layers for acoustic feature extraction with bidirectional temporal layers for modeling temporal dependencies. Evaluated on the UA-Speech dataset, LATTE achieves a word error rate of 12.5%, phoneme error rate of 8.3%, and a character error rate of 1%. By enabling accurate, low-latency transcription of impaired speech, LATTE provides a robust foundation for enhancing communication and accessibility in both digital applications and real-time interactive environments.

  • Research Article
  • 10.1177/2167647x261427458
Leveraging Transformer-GNN Integration for Multilingual News Speech-to-Text Similarity Modeling.
  • Apr 1, 2026
  • Big data
  • Jaishree Jain + 5 more

The increasing volume of multilingual news broadcasts highlights the need for advanced systems capable of transforming speech into semantically comparable text across languages. Traditional speech-to-text and textual similarity methods often fall short in handling linguistic diversity, contextual ambiguity, and cross-lingual semantic alignment. To overcome these limitations, we introduce a Transformer-Graph Neural Network (GNN) integrated framework for multilingual news speech-to-text similarity modeling. This article presents an approach that leverages a Transformer encoder to extract deep contextual embeddings from speech inputs, capturing sequential and contextual nuances. These embeddings are then structured into graphs that represent semantic relations among words, phrases, and sentences. A GNN refines these graph-based representations by modeling relational dependencies across languages. Finally, a cross-lingual semantic alignment module produces similarity scores, enabling accurate transformation of multilingual speech into comparable text. Experiments conducted on benchmark multilingual news video datasets in English, Hindi, Marathi, and Tamil show that our framework consistently outperforms baseline models, including standalone Transformers and GNNs. The model achieved significant gains, with improvements of 7.8% in semantic similarity accuracy, 6.1% in BLEU score, and 8.4% in cross-lingual alignment efficiency. Furthermore, it demonstrated robustness to noisy input, code-switching, and low-resource language scenarios, making it suitable for practical multilingual news applications. The proposed approach achieved a relative improvement of 4.8% in semantic similarity and a 3.1% reduction in word error rate compared with the baseline models. Future directions include extending the framework for real-time deployment, expanding support to underrepresented languages, and incorporating multimodal news data for enriched global media analysis.

  • Research Article
  • 10.11591/ijeecs.v42.i1.pp71-80
Integrating blind source separation and self-supervised learning for Algerian Arabic connected-digit recognition
  • Apr 1, 2026
  • Indonesian Journal of Electrical Engineering and Computer Science
  • Mourad Reggab + 1 more

This paper proposes an improvement in Arabic automatic speech recognition (ASR) by combining blind source separation (BSS) with self-supervised acous tic modeling. The study concentrates on the Algerian Arabic connected-digit recognition task and reexamines the classical degenerate unmixing estimation technique (DUET) as a front-end approach for suppressing noise and inter ference. The output of the BSS stage is fed into a Hidden Markov model (HMM) recognizer developed using the HTK toolkit. To contextualize DUET’s performance, it is compared with modern neural separation techniques (Conv TasNet, SepFormer) paired with both traditional and self-supervised ASR back ends (Wav2Vec 2.0 and Whisper). A new corpus of 11,230 utterances from 37 speakers, representing dialectal and gender diversity, was collected. Experimen tal outcomes indicate that DUET enhances word accuracy under stereo mixing conditions; however, neural separation combined with self-supervised ASR re sults in considerably lower word-error rates and stronger robustness in noisy or overlapping-speech scenarios. The study emphasizes practical trade-offs be tween computational cost and accuracy for deploying low-resource Arabic ASR systems.

  • Research Article
  • 10.38124/ijisrt/26mar883
Bert-Based Speech-to-Text Notes Generator for Educational Content Accessibility
  • Mar 23, 2026
  • International Journal of Innovative Science and Research Technology
  • Isaac, Onoriode, Oshevire + 6 more

Students often find it difficult to take accurate and complete notes during lectures due to fast-paced speech, unfamiliar accents, background noise, and the pressure of multitasking. These challenges are even more pronounced for students with learning difficulties, disabilities, or those who are non-native English speakers. Traditional note-taking methods do not always guarantee clarity or completeness, which affects comprehension and academic performance. With advancements in artificial intelligence (AI), it is now possible to explore automated tools that can transcribe and summarize lectures to support more effective learning. This study addresses the problem of limited access to accurate and real-time lecture notes. Existing speech-to-text systems are often trained on clean, studio-quality datasets and struggle to perform well in real-world classroom environments with noise, diverse accents, and technical terms. Most available solutions are not tailored for Nigerian contexts and fail to meet the academic needs of students. To solve this problem, a solution that integrates advanced AI models was developed to improve transcription accuracy and automatically summarize educational content. The system combines Wav2Vec 2.0 for speech recognition and BERT for extractive summarization. Publicly available datasets such as LJ Speech and CNN/DailyMail were used for training and testing. The audio was preprocessed using noise reduction and segmentation, while the text data underwent tokenization and lemmatization. The models were fine-tuned and integrated into a single application with a graphical interface. The system achieved a Word Error Rate (WER) of 0.2 and a ROUGE-1 score of 0.8, indicating strong performance. The interface allows users to upload or record audio, generate full transcripts, produce summaries, and export the output in readable formats. In conclusion, this project demonstrates that combining transformer-based models like Wav2Vec 2.0 and BERT can provide an efficient and accessible solution for lecture note generation. It enhances learning for all students, particularly those with special needs, and supports inclusive education through AI-based tools.

  • Research Article
  • 10.53735/cisse.v13i1.231
EQīLevel
  • Mar 21, 2026
  • Journal of The Colloquium for Information Systems Security Education
  • Verónica Elze + 2 more

Intelligent tutoring systems (ITS) used in cybersecurity education often lack the ability to respond to learners’ emotional states during complex analytical tasks. This paper introduces EQīLevel, an emotionally adaptive AI tutoring architecture that integrates reinforcement learning (RL), sentiment detection, and large language model (LLM) dialogue within a lightweight command-line interface (CLI) + FastAPI framework. Traditional intelligent tutoring systems often rely on rigid rule-based structures and rarely account for learners’ emotional states, which can reduce engagement and persistence. EQīLevel addresses this limitation by analyzing voice-based cues and dynamically adapting lesson difficulty, tone, and pacing through a JSON-based Model Context Protocol (MCP). The MCP encodes emotion, performance, and learning-style variables into structured state representations that guide Generative Pre-trained Transformer (GPT) dialogue generation and reinforcement learning policy updates. Evaluation using simulated learner interactions demonstrated 78% successful adaptation to frustration scenarios, Whisper transcription accuracy with a 5.3% word error rate (WER), emotion detection accuracy of 84% with 81% tone alignment, and improved reinforcement learning convergence, with average rewards rising from 0.41 to 0.63. In cybersecurity education, EQīLevel illustrates how emotionally adaptive tutoring may help learners remain resilient when confronting ambiguous and adversarial scenarios such as phishing detection and threat analysis. By combining emotional awareness with adaptive instructional control, EQīLevel demonstrates a scalable framework for emotionally adaptive tutoring.

  • Research Article
  • Cite Count Icon 2
  • 10.1038/s41593-026-02218-y
Restoring rapid natural bimanual typing with a neuroprosthesis after paralysis.
  • Mar 16, 2026
  • Nature neuroscience
  • Justin J Jude + 15 more

Here, recognizing keyboard typing as a familiar, high information rate communication paradigm, we developed an intracortical brain-computer interface (iBCI) typing neuroprosthesis providing bimanual QWERTY keyboard functionality for people with paralysis. Typing with this iBCI involves only attempted finger movements, which are decoded accurately with as few as 30 calibration sentences. Sentence decoding is improved using a 5-gram language model. This typing neuroprosthesis performed well for two iBCI clinical trial participants with tetraplegia-one with amyotrophic lateral sclerosis and one with spinal cord injury. Typing speed is user-regulated, reaching 110 characters per minute, resulting in 22 words per minute with a word error rate of 1.6%. This resembles able-bodied typing accuracy and provides higher throughput than current state-of-the-art hand motor iBCI decoding. In summary, a typing neuroprosthesis decoding finger movements, provides an intuitive, familiar and easy-to-learn paradigm for individuals with impaired communication due to paralysis.

  • Research Article
  • 10.3390/computers15030188
Speech-to-Sign Gesture Translation for Kazakh: Dataset and Sign Gesture Translation System
  • Mar 15, 2026
  • Computers
  • Akdaulet Mnuarbek + 4 more

This paper presents the first prototype of a speech-to-sign language translation system for Kazakh Sign Language (KRSL). The proposed pipeline integrates the NVIDIA FastConformer model for automatic speech recognition (ASR) in the Kazakh language and addresses the challenges of sign language translation in a low-resource setting. Unlike American or British Sign Languages, KRSL lacks publicly available datasets and established translation systems. The pipeline follows a multi-stage process: speech input is converted into text via ASR, segmented into phrases, matched with corresponding gestures, and visualized as sign language. System performance is evaluated using word error rate (WER) for ASR and accuracy metrics for speech-to-sign translation. This study also introduces the first KRSL dataset, consisting of 1200 manually recreated signs, including 95% static images and 5% dynamic gesture videos. To improve robustness under resource-constrained conditions, a Weighted Hybrid Similarity Score (WHSS)-based gesture matching method is proposed. Experimental results show that the FastConformer model achieves an average WER of 10.55%, with 7.8% for isolated words and 13.3% for full sentences. At the phrase level, the system achieves 92.1% accuracy for unigrams, 84.6% for bigrams, and 78.3% for trigrams. The complete pipeline reaches 85% accuracy for individual words and 70% for sentences, with an average latency of 310 ms. These results demonstrate the feasibility and effectiveness of the proposed system for supporting people with hearing and speech impairments in Kazakhstan.

  • Research Article
  • Cite Count Icon 1
  • 10.7554/elife.109400
High-fidelity neural speech reconstruction through an efficient acoustic-linguistic dual-pathway framework.
  • Mar 5, 2026
  • eLife
  • Jiawei Li + 4 more

Reconstructing speech from neural recordings is crucial for understanding human speech coding and developing brain-computer interfaces (BCIs). However, existing methods trade off acoustic richness (pitch, prosody) for linguistic intelligibility (words, phonemes). To overcome this limitation, we propose a dual-path framework to concurrently decode acoustic and linguistic representations. The acoustic pathway uses a long-short term memory (LSTM) decoder and a high-fidelity generative adversarial network (HiFi-GAN) to reconstruct spectrotemporal features. The linguistic pathway employs a transformer adaptor and text-to-speech (TTS) generator for word tokens. These two pathways merge via voice cloning to combine both acoustic and linguistic validity. Using only 20 min of electrocorticography (ECoG) data per human subject, our approach achieves highly intelligible synthesized speech (mean opinion score = 4.0/5.0, word error rate = 18.9%). Our dual-path framework reconstructs natural and intelligible speech from ECoG, resolving the acoustic-linguistic trade-off.

  • Research Article
  • Cite Count Icon 3
  • 10.1109/tnnls.2025.3615971
IML-Spikeformer: Input-Aware Multilevel Spiking Transformer for Speech Processing.
  • Mar 1, 2026
  • IEEE transactions on neural networks and learning systems
  • Zeyang Song + 4 more

Spiking neural networks (SNNs), inspired by biological neural mechanisms, represent a promising neuromorphic computing paradigm that offers energy-efficient alternatives to traditional artificial neural networks (ANNs). Despite proven effectiveness, SNN architectures have struggled to achieve competitive performance on large-scale speech processing tasks. Two key challenges hinder progress: 1) the high computational overhead during training caused by multitimestep spike firing and 2) the absence of large-scale SNN architectures tailored to speech processing tasks. To overcome the issues, we introduce the input-aware multilevel spikeformer (IML-Spikeformer), a spiking transformer architecture specifically designed for large-scale speech processing. Central to our design is the input-aware multilevel spike (IMLS) mechanism, which simulates multitimestep spike firing within a single timestep using an adaptive, input-aware thresholding scheme. IML-Spikeformer further integrates a reparameterized spiking self-attention (RepSSA) module with a hierarchical decay mask (HDM), forming the HD-RepSSA module. This module enhances the precision of attention maps and enables modeling of multiscale temporal dependencies in speech signals. Experiments demonstrate that IML-Spikeformer achieves word error rates (WERs) of 6.0% on AiShell-1 and 3.4% on Librispeech-960, comparable to conventional ANN transformers while reducing theoretical inference energy consumption by $4.64\times $ and $4.32\times $ , respectively. IML-Spikeformer marks an advance of scalable SNN architectures for large-scale speech processing in both task performance and energy efficiency. Our source code and model checkpoints are publicly available at github.com/Pooookeman/IML-Spikeformer.

  • Research Article
  • 10.1145/3793254
Post-ASR Correction for Low-Resource Rajasthani Language
  • Feb 27, 2026
  • ACM Transactions on Asian and Low-Resource Language Information Processing
  • Abhishek Bhandari + 1 more

State-of-the-art multilingual Automatic Speech Recognition (ASR) models produce systematic errors when applied to low-resource languages like Rajasthani, for which they lack dedicated training data. This article addresses this challenge by introducing a post-ASR correction framework that leverages the complementary error patterns in the outputs (termed as views) from two distinct models: Whisper-large-v3 and MMS-1B-All. We propose a multi-view, 1 character-level sequence-to-sequence (Seq2Seq) model that uses a gated fusion mechanism to dynamically weigh information from the two ASR outputs. On a new benchmark created from the IndicTTS Rajasthani corpus, our gated model achieves a Character Error Rate (CER) of 7.86% and a Word Error Rate (WER) of 29.98%. This outperforms the best single-view baselines (8.01% CER and 30.33% WER), simple multi-view concatenation (8.21% CER and 30.05% WER), as well as Llama-3.2-3B and mBART-50-large, both fine-tuned on Whisper and MMS inputs. It also surpasses powerful Large Language Models (LLMs) like GPT-4o and Gemini 2.5 Pro in a zero-shot setting. This work establishes the first baseline for post-ASR correction in Rajasthani, demonstrating that a compact, specialized model is more effective than general-purpose LLMs for this targeted, low-resource task.

  • Research Article
  • 10.15587/1729-4061.2026.350949
Improving speech-to-text for the Indonesian language using a modified transformer
  • Feb 27, 2026
  • Eastern-European Journal of Enterprise Technologies
  • Ratna Atika + 2 more

The object of this study is a transformer-based ASR architecture trained using an Indonesian speech dataset consisting of audio recordings and corresponding transcripts. This study examines the development of an Automatic Speech Recognition (ASR) system for Indonesian, which is still classified as a low-resource language, particularly in terms of dataset availability and model performance. The problem addressed in this study is the limited performance of the standard transformer model in accurately recognizing Indonesian speech. To overcome this limitation, an encoder modification integrating convolutional and vision transformer (ViT) blocks was proposed and compared with the baseline model. The data were preprocessed through 16 kHz mono audio conversion, silence segmentation, pre-emphasis filtering, log-Mel spectrogram extraction, normalization, and subword tokenization using SentencePiece with byte pair encoding (BPE). The dataset was divided into training, validation, and testing sets with a ratio of 80:10:10, comprising 63,952, 7,994, and 7,994 samples, respectively. Model generalization was improved using the SpecAugment data augmentation technique. The experimental results show that the standard model achieves a word error rate (WER) of 0.162 and a character error rate (CER) of 0.121, while the modified model reduces the WER to 0.158 and the CER to 0.118. The significance of this finding lies in the improved feature representation produced by the modified encoder, where the convolutional block captures local acoustic patterns and the ViT block enhances global context modeling on the spectrogram. This complementary mechanism explains the reduction in errors at the word level, which is crucial for a reliable speech-to-text system. Therefore, the proposed model can be applied to real-time two-way communication in service robot applications

  • Research Article
  • 10.1044/2025_jslhr-25-00562
Automatic Speech Recognition for Intelligibility Assessment in Children With Dysarthria.
  • Feb 26, 2026
  • Journal of speech, language, and hearing research : JSLHR
  • Jiyoung Choi + 4 more

Accurate assessment of speech intelligibility is critical for children with dysarthria secondary to cerebral palsy. Traditional assessment methods, such as human listeners' orthographic transcription and perceptual ratings (e.g., of ease of understanding [EoU]), are time consuming or subjective. Automatic speech recognition (ASR) may provide a more efficient, objective alternative, but its use for assessing intelligibility in this population is unexamined. This study evaluated the potential of ASR for intelligibility assessment in children with dysarthria and identified the most appropriate ASR systems for approximating human listeners' judgments. Five ASR systems transcribed speech samples from 20 children with dysarthria. Additionally, 168 adult listeners provided orthographic transcriptions and EoU ratings. Word recognition rate (WRR) was used as the metric for calculating ASR and human listeners' transcription accuracy. Spearman correlations were used to assess the relationship between ASR WRR and human WRR, as well as between ASR WRR and human EoU ratings. The WRR yielded by four ASR systems (WhisperX-small, WhisperX-medium, WhisperX-large, and Google Cloud) showed strong correlations with human WRR, with WhisperX-medium demonstrating the strongest correlation. These four systems' WRRs also exhibited moderate-to-strong correlations with EoU ratings, with Google Cloud ASR showing the strongest correlation. In contrast, the WRR of Wav2Vec2 demonstrated a weak correlation with both human WRR and EoU ratings. ASR shows promise for use in intelligibility assessment in children with dysarthria. Of the tested ASR systems, WhisperX-medium appears most promising for approximating human transcription accuracy, whereas Google Cloud ASR aligns best with perceptual ratings. Such differences in ASR performance highlight the need for careful system selection in clinical applications. https://doi.org/10.23641/asha.31397457.

  • Research Article
  • 10.36948/ijfmr.2026.v08i01.69755
Text-to-Speech Conversion Using Python and Natural Language Processing Techniques
  • Feb 24, 2026
  • International Journal For Multidisciplinary Research
  • Vaishnavi T

This paper presents a comprehensive Postgraduate-level study integrating traditional and neural Text-to-Speech (TTS) systems with advanced Natural Language Processing (NLP) techniques. The research combines rule-based, concatenative, statistical, and neural TTS approaches with linguistic preprocessing methods such as phoneme mapping, prosody modeling, contextual embedding, and transformer-based sequence modeling. Experimental evaluation using Processing Time, Mean Opinion Score (MOS), and Word Error Rate (WER) demonstrates that NLP-enhanced synthesis significantly improves speech intelligibility and contextual accuracy. The study contributes a scalable Python-based framework suitable for multilingual and assistive applications.

  • Research Article
  • 10.1145/3799235
OdiSR-TL: An ASR System in Odia Language Using Transfer Learning and Pre-trained Models
  • Feb 24, 2026
  • ACM Transactions on Asian and Low-Resource Language Information Processing
  • Malay Majhi + 1 more

This paper presents the Automatic Speech Recognition (ASR) system we developed for Odia. Odia is the primary language of the Indian state of Odisha, and lacks sufficient annotated speech corpora. However, some other languages have larger publicly available speech resources. Therefore, we employed Transfer Learning for the development. First, we build monolingual pre-trained models using Bengali, Hindi, and English resources. Then, we used the pre-trained models along with the Odia data to develop the ASR model using a Residual Refinement Learning (RRL) network. This transfer learning model performs better than the baseline model. Certain multilingual pre-trained models, such as Whisper-small and Wav2Vec2.0 XLSR-53, have been quite popular in various speech processing tasks. We also employed those models in the Odia ASR task and found that they improve the performance. Furthermore, we propose a hybrid transfer learning technique where two pre-trained models are combined. There, the Whisper-small and Wav2Vec2.0 are combined with the RRL framework. The proposed hybrid transfer learning model outperformed all the previous models. The final model achieved a word error rate (WER) of 1.15 and a character error rate (CER) of 0.14, which is significantly better than the existing Odia ASR systems. The superiority of the proposed model is also tested by implementing several systems and datasets on other Indian languages on a unified platform.

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • .
  • .
  • .
  • 10
  • 1
  • 2
  • 3
  • 4
  • 5

Popular topics

  • Latest Artificial Intelligence papers
  • Latest Nursing papers
  • Latest Psychology Research papers
  • Latest Sociology Research papers
  • Latest Business Research papers
  • Latest Marketing Research papers
  • Latest Social Research papers
  • Latest Education Research papers
  • Latest Accounting Research papers
  • Latest Mental Health papers
  • Latest Economics papers
  • Latest Education Research papers
  • Latest Climate Change Research papers
  • Latest Mathematics Research papers

Most cited papers

  • Most cited Artificial Intelligence papers
  • Most cited Nursing papers
  • Most cited Psychology Research papers
  • Most cited Sociology Research papers
  • Most cited Business Research papers
  • Most cited Marketing Research papers
  • Most cited Social Research papers
  • Most cited Education Research papers
  • Most cited Accounting Research papers
  • Most cited Mental Health papers
  • Most cited Economics papers
  • Most cited Education Research papers
  • Most cited Climate Change Research papers
  • Most cited Mathematics Research papers

Latest papers from journals

  • Scientific Reports latest papers
  • PLOS ONE latest papers
  • Journal of Clinical Oncology latest papers
  • Nature Communications latest papers
  • BMC Geriatrics latest papers
  • Science of The Total Environment latest papers
  • Medical Physics latest papers
  • Cureus latest papers
  • Cancer Research latest papers
  • Chemosphere latest papers
  • International Journal of Advanced Research in Science latest papers
  • Communication and Technology latest papers

Latest papers from institutions

  • Latest research from French National Centre for Scientific Research
  • Latest research from Chinese Academy of Sciences
  • Latest research from Harvard University
  • Latest research from University of Toronto
  • Latest research from University of Michigan
  • Latest research from University College London
  • Latest research from Stanford University
  • Latest research from The University of Tokyo
  • Latest research from Johns Hopkins University
  • Latest research from University of Washington
  • Latest research from University of Oxford
  • Latest research from University of Cambridge

Popular Collections

  • Research on Reduced Inequalities
  • Research on No Poverty
  • Research on Gender Equality
  • Research on Peace Justice & Strong Institutions
  • Research on Affordable & Clean Energy
  • Research on Quality Education
  • Research on Clean Water & Sanitation
  • Research on COVID-19
  • Research on Monkeypox
  • Research on Medical Specialties
  • Research on Climate Justice
Discovery logo
FacebookTwitterLinkedinInstagram

Download the FREE App

  • Play store Link
  • App store Link
  • Scan QR code to download FREE App

    Scan to download FREE App

  • Google PlayApp Store
FacebookTwitterTwitterInstagram
  • Universities & Institutions
  • Publishers
  • R Discovery PrimeNew
  • Ask R Discovery
  • Blog
  • Accessibility
  • Topics
  • Journals
  • Open Access Papers
  • Year-wise Publications
  • Recently published papers
  • Pre prints
  • Questions
  • FAQs
  • Contact us
Lead the way for us

Your insights are needed to transform us into a better research content provider for researchers.

Share your feedback here.

FacebookTwitterLinkedinInstagram
Cactus Communications logo

Copyright 2026 Cactus Communications. All rights reserved.

Privacy PolicyCookies PolicyTerms of UseCareers