Related Topics
Articles published on Speech Synthesis
Authors
Select Authors
Journals
Select Journals
Duration
Select Duration
3527 Search results
Sort by Recency
- New
- Research Article
- 10.22214/ijraset.2026.80134
- Apr 30, 2026
- International Journal for Research in Applied Science and Engineering Technology
- Ninad Muthe
Conventional reading methods and inaccessible learning materials make it difficult for dyslexic children to improve their reading fluency and comprehension, while existing educational tools often lack personalized, interactive support. To address this challenge, this paper proposes Dyslexic Kid Helper, a user-centric web application designed to assist dyslexic children by providing an interactive, engaging, and supportive literacy experience. The system is developed utilizing Python Flask for a highly compatible web architecture , Tesseract Optical Character Recognition (OCR) to extract text from images and PDFs , and OpenAI GPT to power advanced comprehension features like instant definitions and paragraph simplification. Furthermore, the application is Docker-ready to ensure efficient deployment, scalability, and cloud hosting. By transforming complex text into accessible formats and delivering synchronous speech synthesis, the platform enables independent learning while offering a seamless, dyslexia-friendly user experience featuring tailored fonts and minimal distractions. In summary, Dyslexic Kid Helper acts as a comprehensive assistive learning solution, effectively combining AI and machine learning techniques to address the unique challenges faced by dyslexic children.
- Research Article
- 10.1121/10.0043094
- Apr 1, 2026
- The Journal of the Acoustical Society of America
- Patti Adank + 1 more
Voice cloning technology has developed rapidly and can currently produce high-quality humanlike voices from as little as 10 s of speech. It is unclear whether cloned voices are as intelligible as their human originals. We compared the intelligibility of ten human voices with their ten voice clones in background noise. Eighty participants listened to 80 sentences (40 human, 40 cloned), presented in four signal-to-noise ratios (+3, 0, -3, and -6 dB) in an online experiment. Cloned voices were up to 13.4% more intelligible than their human counterparts across all noise levels. Principal component analysis with linear discriminant analysis classified human and cloned voices correctly in 79.4% of cases based on an extensive set of acoustic measurements, confirming systematic acoustic differences between the two voice types. Human listeners identified human voices with 70.4% accuracy. Elastic net regression analyses indicated that intelligibility in cloned voices was driven mainly by pitch and harmonic measures, whereas formant- and vowel-space measures were more influential for human voices. Our findings have implications for applications of voice cloning, including voice restoration, speech synthesis for non-verbal individuals, and accessibility for people with hearing loss.
- Research Article
- 10.1142/s1469026826500045
- Mar 31, 2026
- International Journal of Computational Intelligence and Applications
- S S Gundal + 2 more
Speaker gender recognition (SGR) identifies a speaker’s gender from voice characteristics and is used in speech synthesis, voice assistants and human–computer interaction. Traditional methods only rely on features like pitch, whereas recent approaches use deep learning for better accuracy. However, some challenges remain, such as robustness in noisy environments, handling ambiguous voices and achieving high accuracy across languages. Model bias and ethical concerns pose obstacles to real-world deployment. To address these drawbacks, this paper proposes a Speaker Gender Recognition using Optimized Multi-Component Attention Graph Convolutional Neural Network with EfficientNetB7 (SGR-MAGCNN-EffNetB7) technique. Here, data collected through the Mozilla Common Voice dataset are used. The collected data are fed into the feature extraction stage with the help of Multi-Component Attention Graph Convolutional Neural Network (MAGCNN). The extracted features are given to the EfficientNetB7 for identifying the speaker gender as male and female. EfficientNetB7 is integrated by replacing the convolutional layer of MAGCNN, while retaining its dense layers for classification. Finally, the shrike optimization algorithm (SHOA) is proposed for optimizing the weight parameters of MAGCNN-EffNetB7. The simulation outcomes demonstrate that the proposed SGR-MAGCNN-EffNetB7 approach achieves better accuracy and better precision when compared to the existing methods.
- Research Article
- 10.55041/ijsrem58584
- Mar 30, 2026
- INTERNATIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT
- Eswar Raja + 3 more
Abstract—The recruitment process is high-stakes, yet many candidates struggle due to anxiety and limited, unstructured feedback. Existing mock interview tools do not scale well and rarely capture behavioral signals beyond text. In this work, we present Interview Ai, a multi-modal system for interview simulation and performance analysis. The interaction is delivered through a conversational avatar that speaks and responds in real time, creating a more realistic setting than text-only systems. We model the interview flow as a finite state machine, which allows controlled transitions and adaptive follow-up questions. Candidate responses are evaluated using retrieval-based scoring, where answers are compared against exemplar knowledge rather than relying purely on model generation. The evaluation also incorporates measurable signals, such as filler word frequency, pauses, and response structure. In a study with 50 participants, users showed improvements in clarity, confidence indicators, and reduced hesitation compared to traditional practice. These results suggest that combining struc- tured control, grounded evaluation, and multi-modal interaction can provide effective and scalable interview coaching without replacing human decision-making. Index Terms—Large Language Models, LangGraph, Auto- mated Interview Systems, Computer Vision, Natural Language Processing, Speech Synthesis, GPT-4o-mini
- Research Article
- 10.53866/jimi.v6i1.1247
- Mar 26, 2026
- Citizen : Jurnal Ilmiah Multidisiplin Indonesia
- Tigus Juni Betri
Access to communication is a fundamental right of every individual, including the speech-impaired community, which often faces limitations in social interaction. This study designs a real-time system that transforms sign language alphabet gestures into speech by utilizing Computer Vision and Deep Learning technologies. A sign language alphabet dataset is processed using a Convolutional Neural Network (CNN) to recognize visual hand patterns representing letters A–Z. The trained model is then integrated with OpenCV and Mediapipe for real-time hand gesture detection and connected to a speech synthesis engine so that the recognized letters can be automatically spoken. The results demonstrate the system’s potential as a basic communication bridge that supports digital inclusion for the speech-impaired community. From a sustainable development perspective, this innovation is relevant to SDG 4 (Quality Education) and SDG 10 (Reduced Inequalities), as it enables more equitable, inclusive, and sustainable social interaction in the era of digital transformation. This innovation can also serve as a foundation for further development toward an automatic sign language translator capable of recognizing full words and complete sentences. Consequently, this system has strong potential to become a practical solution for ensuring equal access to communication across various sectors, including education, public services, and the workplace.
- Research Article
- 10.3390/electronics15071354
- Mar 25, 2026
- Electronics
- Dongfeng Ye + 3 more
To address the limited expressiveness in current speech synthesis caused by coarse-grained prosody modeling and simplistic feature fusion strategies, a joint prosody modeling framework and a nonlinear fusion method named KAFusion are proposed, based on the Kolmogorov–Arnold (KA) representation theorem. The joint modeling integrates pitch and energy as prosodic priors with text encodings to jointly guide duration prediction, enabling explicit control over speech rate and tone. During feature fusion, KAFusion facilitates nonlinear interactions among features through its nested inner and outer functions. Information entropy serves as the quantitative metric, and both theoretical and experimental results demonstrate the fusion module’s efficacy in suppressing redundancy while preserving task-critical content. Evaluations on the AISHELL3 dataset show a 5.8% improvement in MOS over the baseline. Ablation studies further validate the effectiveness of the proposed components, where KAFusion achieves an output entropy of 3.47, which is 18.4% higher than that of linear fusion (2.93) and indicates richer information content.
- Research Article
- 10.3390/jimaging12030119
- Mar 10, 2026
- Journal of imaging
- Hira Nisar + 3 more
Talking Head Generation (THG) is a rapidly advancing field at the intersection of computer vision, deep learning, and speech synthesis, enabling the creation of animated human-like heads that can produce speech and express emotions with high visual realism. The core objective of THG systems is to synthesize coherent and natural audio-visual outputs by modeling the intricate relationship between speech signals, facial dynamics, and emotional cues. These systems find widespread applications in virtual assistants, interactive avatars, video dubbing for multilingual content, educational technologies, and immersive virtual and augmented reality environments. Moreover, the development of THG has significant implications for accessibility technologies, cultural preservation, and remote healthcare interfaces. This survey paper presents a comprehensive and systematic overview of the technological landscape of Talking Head Generation. We begin by outlining the foundational methodologies that underpin the synthesis process, including generative adversarial networks (GANs), motion-aware recurrent architectures, and attention-based models. A taxonomy is introduced to organize the diverse approaches based on the nature of input modalities and generation goals. We further examine the contributions of various domains such as computer vision, speech processing, and human-robot interaction, each of which plays a critical role in advancing the capabilities of THG systems. The paper also provides a detailed review of datasets used for training and evaluating THG models, highlighting their coverage, structure, and relevance. In parallel, we analyze widely adopted evaluation metrics, categorized by their focus on image quality, motion accuracy, synchronization, and semantic fidelity. Operating parameters such as latency, frame rate, resolution, and real-time capability are also discussed to assess deployment feasibility. Special emphasis is placed on the integration of generative artificial intelligence (GenAI), which has significantly enhanced the adaptability and realism of talking head systems through more powerful and generalizable learning frameworks.
- Research Article
- 10.1016/j.asoc.2025.114466
- Mar 1, 2026
- Applied Soft Computing
- Yang Liu + 5 more
sEMG-based real-time speech synthesis with feature distillation and dynamic chunk convolution
- Research Article
- 10.1121/10.0042974
- Mar 1, 2026
- JASA express letters
- Eray Eren + 3 more
Recent zero-shot style-transfer speech synthesis methods have shown promising results and addressed adaptation to unseen speaking styles. While most state-of-the-art methods generalize to new speakers and styles using large models or corpora, achieving similar generalization with a smaller model remains an open challenge. We propose a zero-shot method that uses the small GenerSpeech backbone plus a fine-grained style encoder. To disentangle speakers, global/fine-grained styles, and content embeddings, we introduce a mutual-information minimization loss. To further disentangle style from speaker and boost style embedding diversity, we introduce a maximum-mean-discrepancy-guided cycle consistency loss. Experimental results show the proposed method outperforms baseline zero-shot style-transfer methods (GenerSpeech, YourTTS, VALL-E-X) with a relative average style preference improvement of 31% and a 3.64 prosody prosody similarity mean opinion score on VCTK.
- Research Article
- 10.22214/ijraset.2026.77755
- Feb 28, 2026
- International Journal for Research in Applied Science and Engineering Technology
- Rahul Jadhav
This study introduces a real-time speech-to-speech translation framework designed for offline environments, incorporating emotion-aware artificial intelligence and voice-driven interaction to enhance natural multilingual communication. Re- cent advancements in artificial intelligence have enabled signifi- cant improvements in speech-based human–computer interaction systems. However, most commercially available speech translators rely on cloud-based services, resulting in high latency, privacy concerns, and limited usability in low-connectivity environments. The proposed system combines Automatic Speech Recognition (ASR), Neural Machine Translation (NMT), emotion classification, and Text-to-Speech (TTS) synthesis into a unified modular architecture capable of operating without continuous internet access. Speech input is processed locally using lightweight acoustic models, enabling efficient real-time transcription. Emotional characteristics are extracted using prosodic and spectral speech features such as pitch variation, energy distribution, and Mel- frequency cepstral coefficients (MFCCs), allowing the system to interpret contextual sentiment during communication. A transformer-based neural translation framework performs multilingual conversion while maintaining semantic consistency. Emotion-aware speech synthesis further enhances communication by adapting output tone and expressiveness. Additionally, an offline voice-command interface enables hands-free interaction, improving accessibility for visually impaired users and assistive communication scenarios. Experimental evaluation across English, Hindi, and Marathi datasets demonstrates improved recognition accuracy, reduced response latency, and stable offline performance compared with traditional cloud-dependent systems. The proposed framework provides a scalable, privacy-preserving, and resource-efficient solution suitable for educational tools, assistive technologies, and multilingual communication platforms operating in constrained environments
- Research Article
- 10.55041/ijsrem56917
- Feb 26, 2026
- International Journal of Scientific Research in Engineering and Management
- Ashish Raj + 4 more
Abstract: The main purpose of this research is to enhance the communication of the disabled community. The authors of this chapter propose an enhanced interpersonal-human interaction for people with special needs, especially those with physical and communication disabilities. Existing communication assistive technology requires the use of costly hardware; hence, the need for an affordable communication assistive technology for paralyzed people to communicate. So, this study introduces an affordable and real-time assistive communication technology for paralyzed people This assistive communication technology uses Dlib for the detection of face landmarks, machine learning algorithms for the classification of facial expressions and synthesis of text and speech for the assistive communication technology to communicate. Various facial expressions—are linked to the predefined communication sentences to enable meaningful communication for paralyzed people. Keywords: Affordable Assistive Systems, Communication Accessibility, Computer Vision in Healthcare, Intelligent Assistive Devices, Accessible Technology, Physiological Signal Interpretation, Inclusive Healthcare Innovation.
- Research Article
- 10.64898/2026.02.10.705088
- Feb 11, 2026
- bioRxiv : the preprint server for biology
- Amirhossein Khalilian-Gourtani + 9 more
Speech is a defining human behavior, and this ability depends critically on speech motor cortex. While the ventral precentral and postcentral gyri are classically regarded as chiefly articulatory and somatosensory regions, a growing body of literature challenges this simplification. Most prior research, however, has examined cued or structured speech production tasks, neglecting the automatic, overlearned speech commonly utilized in clinical assessment. Consequently, the neural dynamics and precise timing of cortical recruitment during automatic speech remain poorly understood. Here, we present intracranial electrocorticography (ECoG) recordings from the left perisylvian cortex in participants performing automatic speech such as counting and recitation of overlearned sequences. We investigate neural dynamics using encoding (multivariate temporal response function) and decoding (deep neural network speech synthesis) models. We show that automatic speech engages a distributed network across superior temporal, precentral, and post-central cortices, characterized by attenuated pre-articulatory activity and weaker frontal encoding. Furthermore, two complementary decoding strategies reveal that speech motor cortex represents a mixture of feedforward and feedback signals, with a subset of sites exhibiting exclusively feed-forward dynamics. These results delineate the spatiotemporal cortical organization of automatic speech and establish that the speech motor cortex supports more complex dynamics than purely feedforward control.
- Research Article
- 10.46914/2959-3999-2025-1-4-40-48
- Feb 7, 2026
- Eurasian Journal of Current Research in Psychology and Pedagogy
- А А Мukhametkali + 1 more
The article examines the pedagogical features of integrating artificial intelligence (AI) technologies into the process of teaching Kazakh at higher education institutions. Artificial intelligence is viewed as a tool for enhancing the effectiveness of language education through personalization, data-driven feedback, and increased learner engagement in the digital environment. The authors analyse the didactic potential of AI-based platforms for supporting differentiated instruction, formative assessment, and interactive communication between teacher and student. Special attention is given to national digital resources such as KazakhTTS, KazNERD, Speech Lab and KSC2, which enable speech synthesis, automatic recognition of speech and text, as well as named entity recognition in Kazakh. Their possible applications for developing listening, speaking, reading and writing skills are discussed. At the same time, the article highlights key challenges such as the limited volume of Kazakh-language content, uneven levels of digital competence among teachers, infrastructural constraints, and risks related to academic integrity and data security. Based on the research results, the authors conclude that effective use of artificial intelligence in teaching Kazakh requires aligning technological tools with pedagogical design, teacher training, and national language policy, as well as providing state support for Kazakh-language digital ecosystems and start-up projects.
- Research Article
- 10.3171/2025.11.focus25908
- Feb 1, 2026
- Neurosurgical focus
- Kurt R Lehner + 14 more
The aim of this study was to evaluate the feasibility of using the Layer 7 Cortical Interface, a high-density micro-electrocorticography (μECoG) array, for intraoperative neural recordings and real-time brain-computer interface (BCI) applications, including speech decoding and cursor control. Four patients (age range 23-43 years) who underwent awake craniotomy for tumor resection near the eloquent cortex were enrolled. The Layer 7 µECoG device (1024 channels, approximately 1.5-cm2 coverage) was placed on the motor cortex following standard cortical mapping. Intraoperative tasks included a joystick-controlled center-out movement paradigm (n = 3) and an auditory-cued speech repetition task (n = 1). Neural data were recorded at 20 kHz, preprocessed, and used to train decoders intraoperatively. A transformer-based model was applied for real-time speech synthesis and a convolutional neural network was trained for speech classification, while a convolutional recurrent neural network was trained to classify 2D cursor direction. All 4 patients tolerated the procedure without device-related adverse events. The mean electrode impedances across 6 arrays (6144 channels) ranged from 1.21 to 1.99 MΩ, with 954-990 channels per array retained for analysis. In the speech task, a 4-word classification model achieved 77.5% accuracy, and a real-time synthesis model was able to distinguish speech and silence during approximately 20 minutes of data recording in the operating room. In the motor task, a 4-direction classification model achieved 78%-84% accuracy. Recordings remained stable during tumor resection. The Layer 7 Cortical Interface device enabled high-resolution nonpenetrating cortical recordings that supported real-time speech classification and cursor control within the limited timeframe of an intraoperative session. These findings highlight the potential clinical applications of high-density µECoG for functional mapping, diagnostic assessment, and future chronic BCI systems for patients with motor and communication impairments.
- Research Article
- 10.54097/wmfh7e61
- Jan 29, 2026
- Academic Journal of Science and Technology
- Jiajun Ge + 2 more
As a critical subfield of speech signal processing, speech intonation recognition technology aims to interpret paralinguistic features (such as pitch, rhythm, and energy) beyond the textual content of an utterance. Its development provides the core driver for enhancing the naturalness and emotional intelligence of human-computer interaction. This study focuses on intonation recognition technology, a critical component of speech signal processing. Its development has progressed from rule-based to statistical models, and now to deep learning models, resulting in steadily improving recognition accuracy. Regarding feature extraction, the acquisition of speech signal characteristics such as pitch, duration, and volume provides the data foundation for recognition models. Recognition algorithms have evolved from early Hidden Markov Models (HMMs) and Support Vector Machines (SVMs) to current mainstream deep learning models like Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). This technology is widely applied in areas such as voice assistants, intelligent customer service, and speech synthesis. For instance, it enables emotion analysis in voice assistants to adjust service strategies and enhances naturalness in synthesized speech. Current research focuses on developing algorithms robust to noise and accent interference while integrating with cognitive science. Future breakthroughs leveraging deep learning are anticipated in model complexity and recognition accuracy. Furthermore, driven by Internet of things and 5G technologies, applications are expected to expand into smart homes, telemedicine, and other domains.
- Research Article
- 10.25587/3034-7378-2025-4-56-78
- Jan 17, 2026
- Arctic XXI century
- S P Stepanov + 8 more
Recent breakthroughs in artificial intelligence and deep learning have fundamentally transformed the landscape of spoken language processing technologies. Automatic speech recognition (ASR) and text-to-speech (TTS) synthesis have emerged as essential components driving digital accessibility across diverse linguistic communities. The Sakha language, representing the northeastern branch of the Turkic language family, continues to face substantial technological barriers stemming from insufficient digital resources, limited annotated corpora, and the absence of production-ready speech processing systems. This comprehensive investigation examines the feasibility and effectiveness of adapting contemporary transformer-based neural architectures for bidirectional speech conversion tasks in Sakha. Our research encompasses detailed analysis of encoder-decoder frameworks, specifically OpenAI’s Whisper large-v3 and Meta’s Wav2Vec2-BERT for voice-to-text transformation, alongside Coqui’s XTTS-v2 system for text-to-voice generation. Particular emphasis is placed on addressing linguistic and technical obstacles inherent to Sakha, including its complex agglutinative morphological structure, systematic vowel harmony patterns, and distinctive phonemic inventory featuring sounds absent from most Indo-European languages. Experimental evaluation demonstrates that comprehensive fine-tuning of Whisper-large-v3 achieves exceptional recognition accuracy with word error rate (WER) of 8%, while the self-supervised Wav2Vec2-BERT architecture attains 13% WER when augmented with statistical n-gram language modeling. The neural synthesis system exhibits robust performance despite minimal training data availability, achieving average loss of 2.49 following extended training optimization and practical deployment via Telegram messaging bot. Additionally, ensemble meta-stacking combining both recognition architectures achieves 27% WER, demonstrating effective complementarity through learned hypothesis arbitration. These findings validate transfer learning methodologies as viable pathways for developing speech technologies serving digitally underrepresented linguistic communities.
- Research Article
- 10.3390/computation14010020
- Jan 14, 2026
- Computation
- Elsayed Issa
As generative speech synthesis produces near-human synthetic voices and reliance on online media grows, robust audio-deepfake detection is essential to fight misuse and misinformation. In this study, we introduce the Arabic Fake Audio Dataset for Modern Standard Arabic (AFAD-MSA), a curated corpus of authentic and synthetic Arabic speech designed to advance research on Arabic deepfake and spoofed-speech detection. The synthetic subset is generated with four state-of-the-art proprietary text-to-speech and voice-conversion models. Rich metadata—covering speaker attributes and generation information—is provided to support reproducibility and benchmarking. To establish reference performance, we trained three AASIST models and compared their performance to two baseline transformer detectors (Wav2Vec 2.0 and Whisper). On the AFAD-MSA test split, AASIST-2 achieved perfect accuracy, surpassing the baseline models. However, its performance declined under cross-dataset evaluation. These results underscore the importance of data construction. Detectors generalize best when exposed to diverse attack types. In addition, continual or contrastive training that interleaves bona fide speech with large, heterogeneous spoofed corpora will further improve detectors’ robustness.
- Research Article
- 10.22399/ijcesen.4684
- Jan 7, 2026
- International Journal of Computational and Experimental Science and Engineering
- Ganga Gudi + 2 more
Grapheme-to-phoneme (G2P) mapping plays a vital role in the development of text-to-speech systems, particularly for languages with complex morphology and limited computational resources such as Kannada. Existing G2P techniques based on handcrafted rules or supervised machine learning depend heavily on linguistic knowledge or large volumes of labeled data, making them difficult to scale for low-resource languages. To address these challenges, this work introduces a reinforcement learning–driven approach for Kannada grapheme-to-phoneme conversion. The task is modeled as a stepwise decision process in which an intelligent agent incrementally predicts phoneme sequences from written text by learning an optimal policy guided by a reward function that reflects pronunciation correctness and phonological coherence. By learning through interaction rather than direct supervision, the proposed framework adapts effectively to novel word forms and pronunciation variations. Experimental evaluation on a Kannada text dataset shows that the reinforcement learning model produces more accurate phoneme sequences and lower error rates when compared to conventional rule-based and statistical G2P methods. These findings demonstrate the potential of reinforcement learning as a flexible and data-efficient solution for building robust G2P systems in low-resource Indian languages, ultimately enhancing the clarity and naturalness of synthesized Kannada speech.
- Research Article
- 10.3390/electronics15010239
- Jan 5, 2026
- Electronics
- Xiugong Qin + 5 more
Text-to-Speech (TTS) methods typically employ a sequential approach with an Acoustic Model (AM) and a vocoder, using a Mel spectrogram as an intermediate representation. However, in home environments, TTS systems often struggle with issues such as inadequate robustness against environmental noise and limited adaptability to diverse speaker characteristics. The quality of the Mel spectrogram directly affects the performance of TTS systems, yet existing methods overlook the potential of enhancing Mel spectrogram quality through more comprehensive speech features. To address the complex acoustic characteristics of home environments, this paper introduces AirSpeech, a post-processing model for Mel-spectrogram synthesis. We adopt a Generative Adversarial Network (GAN) to improve the accuracy of Mel spectrogram prediction and enhance the expressiveness of synthesized speech. By incorporating additional conditioning extracted from synthesized audio using specified speech feature parameters, our method significantly enhances the expressiveness and emotional adaptability of synthesized speech in home environments. Furthermore, we propose a global normalization strategy to stabilize the GAN training process. Through extensive evaluations, we demonstrate that the proposed method significantly improves the signal quality and naturalness of synthesized speech, providing a more user-friendly speech interaction solution for smart home applications.
- Research Article
- 10.47191/ijmra/v9-i1-02
- Jan 3, 2026
- International Journal of Multidisciplinary Research and Analysis
- Raghdah Adnan Abdulrazzq
One of the most pressing issues in modern studies of human-computer interaction is emotion-aware computing; a promising area of study that promises to provide major advances in the near future is genuine expressive speech synthesis. In this study, we detail our continuing efforts to construct expressive text-to-speech synthesis systems by creating a data-driven framework for annotation, modeling, and analysis of expressive speech. Here we detail the data-driven approach, which includes features like expression grouping and aural analysis as well as a web-based platform for voice recognition and annotation. There are also some promising signs for future study in the form of preliminary findings.