Accelerate Literature Icon
Want to do a literature review? Try our new Literature Review workflow

AirSpeech: Lightweight Speech Synthesis Framework for Home Intelligent Space Service Robots

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Text-to-Speech (TTS) methods typically employ a sequential approach with an Acoustic Model (AM) and a vocoder, using a Mel spectrogram as an intermediate representation. However, in home environments, TTS systems often struggle with issues such as inadequate robustness against environmental noise and limited adaptability to diverse speaker characteristics. The quality of the Mel spectrogram directly affects the performance of TTS systems, yet existing methods overlook the potential of enhancing Mel spectrogram quality through more comprehensive speech features. To address the complex acoustic characteristics of home environments, this paper introduces AirSpeech, a post-processing model for Mel-spectrogram synthesis. We adopt a Generative Adversarial Network (GAN) to improve the accuracy of Mel spectrogram prediction and enhance the expressiveness of synthesized speech. By incorporating additional conditioning extracted from synthesized audio using specified speech feature parameters, our method significantly enhances the expressiveness and emotional adaptability of synthesized speech in home environments. Furthermore, we propose a global normalization strategy to stabilize the GAN training process. Through extensive evaluations, we demonstrate that the proposed method significantly improves the signal quality and naturalness of synthesized speech, providing a more user-friendly speech interaction solution for smart home applications.

Similar Papers
  • PDF Download Icon
  • Research Article
  • Cite Count Icon 17
  • 10.3390/app13010569
Analyzing Noise Robustness of Cochleogram and Mel Spectrogram Features in Deep Learning Based Speaker Recognition
  • Dec 31, 2022
  • Applied Sciences
  • Wondimu Lambamo + 2 more

The performance of speaker recognition systems is very well on the datasets without noise and mismatch. However, the performance gets degraded with the environmental noises, channel variation, physical and behavioral changes in speaker. The types of Speaker related feature play crucial role in improving the performance of speaker recognition systems. Gammatone Frequency Cepstral Coefficient (GFCC) features has been widely used to develop robust speaker recognition systems with the conventional machine learning, it achieved better performance compared to Mel Frequency Cepstral Coefficient (MFCC) features in the noisy condition. Recently, deep learning models showed better performance in the speaker recognition compared to conventional machine learning. Most of the previous deep learning-based speaker recognition models has used Mel Spectrogram and similar inputs rather than a handcrafted features like MFCC and GFCC features. However, the performance of the Mel Spectrogram features gets degraded in the high noise ratio and mismatch in the utterances. Similar to Mel Spectrogram, Cochleogram is another important feature for deep learning speaker recognition models. Like GFCC features, Cochleogram represents utterances in Equal Rectangular Band (ERB) scale which is important in noisy condition. However, none of the studies have conducted analysis for noise robustness of Cochleogram and Mel Spectrogram in speaker recognition. In addition, only limited studies have used Cochleogram to develop speech-based models in noisy and mismatch condition using deep learning. In this study, analysis of noise robustness of Cochleogram and Mel Spectrogram features in speaker recognition using deep learning model is conducted at the Signal to Noise Ratio (SNR) level from −5 dB to 20 dB. Experiments are conducted on the VoxCeleb1 and Noise added VoxCeleb1 dataset by using basic 2DCNN, ResNet-50, VGG-16, ECAPA-TDNN and TitaNet Models architectures. The Speaker identification and verification performance of both Cochleogram and Mel Spectrogram is evaluated. The results show that Cochleogram have better performance than Mel Spectrogram in both speaker identification and verification at the noisy and mismatch condition.

  • PDF Download Icon
  • Research Article
  • 10.3390/app14146336
Using Transfer Learning to Realize Low Resource Dungan Language Speech Synthesis
  • Jul 20, 2024
  • Applied Sciences
  • Mengrui Liu + 2 more

This article presents a transfer-learning-based method to improve the synthesized speech quality of the low-resource Dungan language. This improvement is accomplished by fine-tuning a pre-trained Mandarin acoustic model to a Dungan language acoustic model using a limited Dungan corpus within the Tacotron2+WaveRNN framework. Our method begins with developing a transformer-based Dungan text analyzer capable of generating unit sequences with embedded prosodic information from Dungan sentences. These unit sequences, along with the speech features, provide <unit sequence with prosodic labels, Mel spectrograms> pairs as the input of Tacotron2 to train the acoustic model. Concurrently, we pre-trained a Tacotron2-based Mandarin acoustic model using a large-scale Mandarin corpus. The model is then fine-tuned with a small-scale Dungan speech corpus to derive a Dungan acoustic model that autonomously learns the alignment and mapping of the units to the spectrograms. The resulting spectrograms are converted into waveforms via the WaveRNN vocoder, facilitating the synthesis of high-quality Mandarin or Dungan speech. Both subjective and objective experiments suggest that the proposed transfer learning-based Dungan speech synthesis achieves superior scores compared to models trained only with the Dungan corpus and other methods. Consequently, our method offers a strategy to achieve speech synthesis for low-resource languages by adding prosodic information and leveraging a similar, high-resource language corpus through transfer learning.

  • Conference Article
  • Cite Count Icon 2
  • 10.1117/12.2659719
Research on synthesis of designated speaker speech based on StarGAN-VC model
  • Nov 30, 2022
  • Xiaohong Qiu + 1 more

With the rapid development of deep learning, the research focus of speech synthesis has gradually shifted to artificial neural network technology. The speech quality has been greatly improved and has been introduced into many application scenarios. However, the existing synthesis systems need to use rich and high-quality parallel data sets when training models, and the synthesized speech is also weak in personalized performance. This paper describes an improved Mel spectrogram acoustic feature sequence prediction model based on Tacotron2 and a StarGAN-VC model. The model uses the predicted Mel spectrogram as input to generate Mel spectrogram sequence of the specified speaker and synthesize speech. StarGAN-VC model can train the model in non-parallel Mini dataset, generate Mel spectrogram sequence of designated speaker in real time and synthesize speech, which can well solve the problem of lack of non-parallel dataset and enrich the speech content generated by StarGAN -VC model. The experimental results show that StarGAN-VC model can generate relatively smooth Mel spectrogram by using the Mel spectrogram sequence predicted by the improved model, and have stronger expressiveness in dealing with the details of Mel spectrogram, so as to synthesize smooth and high intelligible speech. The model uses the speech data of the designated speaker for about 27 minutes to train the model and synthesize personalized speech, which provides an effective reference for the synthesis of personalized speech.

  • Conference Article
  • Cite Count Icon 9
  • 10.1109/icce.2019.8661919
Emotional Speech Synthesis for Multi-Speaker Emotional Dataset Using WaveNet Vocoder
  • Jan 1, 2019
  • Heejin Choi + 3 more

This paper studies the methods for emotional speech synthesis using a neural vocoder. For a neural vocoder, WaveNet is used, which generates waveforms from mel spectrograms. We propose two networks, i.e., deep convolutional neural network (CNN)-based text-to-speech (TTS) system and emotional converter, and deep CNN architecture is designed as to utilize long-term context information. The first network estimates neutral mel spectrograms using linguistic features, and the second network converts neutral mel spectrograms to emotional mel spectrograms. Experimental results on a TTS system and emotional TTS system, showed that the proposed systems are indeed a promising approach.

  • Conference Article
  • 10.2991/iceeim-14.2014.43
Dynamic Speech Feature Parameter Extraction Based on Fitting
  • Jan 1, 2014
  • Yingjie Meng + 3 more

In view of the existing research of the speech feature parameter recognition, the anti noise is poor and storage capacity is larger. So, data fitting has been introduced into speech feature parameter extraction to enhance that. Combine with speech spectrum dynamic changes and the short-time energy smooth stationary of speech signal, this paper puts forward and designs a arithmetic of dynamic speech feature parameter extraction based on fitting, and constructs the feature parameter extraction and personal identification scheme. And also designs critical modules algorithm. The detail process of feature parameter extraction: firstly, it created 2-d coordinate for each frame data. Then, we use 2-d coordinate system to fit for making the fitting function is matched primary data perfectly, and get the best fitting order of each frame. Lastly, it extracts the feature parameter which has been combined with the fitting order in each frame. The arithmetic has been simulated an experiment, in order to confirm the applicability and feasibility. The results illustrates the method has preferable anti-noise performance, especially expression and storage for speech segment feature parameter show more obvious advantages. Index Terms - speech recognition, feature parameter, extraction method

  • Conference Article
  • Cite Count Icon 9
  • 10.1109/icassp48485.2024.10446830
SpecDiff-GAN: A Spectrally-Shaped Noise Diffusion GAN for Speech and Music Synthesis
  • Apr 14, 2024
  • Teysir Baoueb + 4 more

Generative adversarial network (GAN) models can synthesize high-quality audio signals while ensuring fast sample generation. However, they are difficult to train and are prone to several issues including mode collapse and divergence. In this paper, we introduce SpecDiff-GAN, a neural vocoder based on HiFi-GAN, which was initially devised for speech synthesis from mel spectrogram. In our model, the training stability is enhanced by means of a forward diffusion process which consists in injecting noise from a Gaussian distribution to both real and fake samples before inputting them to the discriminator. We further improve the model by exploiting a spectrally-shaped noise distribution with the aim to make the discriminator's task more challenging. We then show the merits of our proposed model for speech and music synthesis on several datasets. Our experiments confirm that our model compares favorably in audio quality and efficiency compared to several baselines.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 4
  • 10.3389/fresc.2023.1121034
Estimation of subjective quality of life in schizophrenic patients using speech features.
  • Mar 10, 2023
  • Frontiers in Rehabilitation Sciences
  • Yuko Shibata + 5 more

Patients with schizophrenia experience the most prolonged hospital stay in Japan. Also, the high re-hospitalization rate affects their quality of life (QoL). Despite being an effective predictor of treatment, QoL has not been widely utilized due to time constraints and lack of interest. As such, this study aimed to estimate the schizophrenic patients' subjective quality of life using speech features. Specifically, this study uses speech from patients with schizophrenia to estimate the subscale scores, which measure the subjective QoL of the patients. The objectives were to (1) estimate the subscale scores from different patients or cross-sectional measurements, and 2) estimate the subscale scores from the same patient in different periods or longitudinal measurements. A conversational agent was built to record the responses of 18 schizophrenic patients on the Japanese Schizophrenia Quality of Life Scale (JSQLS) with three subscales: "Psychosocial," "Motivation and Energy," and "Symptoms and Side-effects." These three subscales were used as objective variables. On the other hand, the speech features during measurement (Chromagram, Mel spectrogram, Mel-Frequency Cepstrum Coefficient) were used as explanatory variables. For the first objective, a trained model estimated the subscale scores for the 18 subjects using the Nested Cross-validation (CV) method. For the second objective, six of the 18 subjects were measured twice. Then, another trained model estimated the subscale scores for the second time using the 18 subjects' data as training data. Ten different machine learning algorithms were used in this study, and the errors of the learned models were compared. The results showed that the mean RMSE of the cross-sectional measurement was 13.433, with k-Nearest Neighbors as the best model. Meanwhile, the mean RMSE of the longitudinal measurement was 13.301, using Random Forest as the best. RMSE of less than 10 suggests that the estimated subscale scores using speech features were close to the actual JSQLS subscale scores. Ten out of 18 subjects were estimated with an RMSE of less than 10 for cross-sectional measurement. Meanwhile, five out of six had the same observation for longitudinal measurement. Future studies using a larger number of subjects and the development of more personalized models based on longitudinal measurements are needed to apply the results to telemedicine for continuous monitoring of QoL.

  • Research Article
  • Cite Count Icon 20
  • 10.1016/j.specom.2006.06.008
A feature extraction method using subband based periodicity and aperiodicity decomposition with noise robust frontend processing for automatic speech recognition
  • Jul 21, 2006
  • Speech Communication
  • Kentaro Ishizuka + 1 more

A feature extraction method using subband based periodicity and aperiodicity decomposition with noise robust frontend processing for automatic speech recognition

  • Research Article
  • 10.24425/aee.2026.156800
Transformer fault diagnosis method based on multilevel acoustic information
  • Jan 12, 2026
  • Archives of Electrical Engineering
  • Xuan Li + 6 more

To more accurately obtain the feature information embedded in the acoustic pattern of transformers, a transformer fault diagnosis method is proposed based on multilevel acoustic information of 14 state types. In this method, a parallel dual-channel fault diagnosis model, CNN-BiLSTM-Transformer, is established. First, the modified Mel inversion coefficients and Mel spectrograms are extracted from the original acoustic pattern data. The modified Mel inversion coefficients and Mel spectrograms are then input into the parallel dual-channel model. In the first channel, a convolutional neural network model is used to extract the feature information of maps. In the second channel, a bidirectional long- and short-term memory network and a Transformer encoder are used to partially extract the temporal features in the MFCCs. Finally, the temporal features extracted from the two channels are fused through multimodal fusion for training. The experimental results show that the proposed diagnostic method can achieve an average accuracy of 99.5% in multiple fault diagnosis. Compared with current mainstream acoustic single-channel diagnostic models, the diagnostic rate of this model is improved by an average of 4.8%, exhibiting higher accuracy and robustness.

  • Front Matter
  • Cite Count Icon 4
  • 10.1016/j.jpeds.2011.04.003
Home Environment, Asthma, and Obesity: How Are They Related?
  • May 18, 2011
  • The Journal of Pediatrics
  • Donna R Halloran

Home Environment, Asthma, and Obesity: How Are They Related?

  • Research Article
  • Cite Count Icon 4
  • 10.1371/journal.pone.0319027
An improved ViT model for music genre classification based on mel spectrogram
  • Mar 13, 2025
  • PLOS One
  • Pingping Wu + 6 more

Automating the task of music genre classification offers opportunities to enhance user experiences, streamline music management processes, and unlock insights into the rich and diverse world of music. In this paper, an improved ViT model is proposed to extract more comprehensive music genre features from Mel spectrograms by leveraging the strengths of both convolutional neural networks and Transformers. Also, the paper incorporates a channel attention mechanism by amplifying differences between channels within the Mel spectrograms of individual music genres, thereby facilitating more precise classification. Experimental results on the GTZAN dataset show that the proposed model achieves an accuracy of 86.8%, paving the way for more accurate and efficient music genre classification methods compared to earlier approaches.

  • Research Article
  • Cite Count Icon 2
  • 10.1088/1742-6596/1631/1/012039
Music Style Transfer with Vocals Based on CycleGAN
  • Sep 1, 2020
  • Journal of Physics: Conference Series
  • Hongliang Ye + 1 more

In recent years, with the development of generative adversarial networks (GAN), the application of generative adversarial networks has gradually matured. An important application area for generating adversarial networks is called neural style transfer. In recent years, neural style transfer has played a major role in the field of image applications. However, it performed poorly in the music field. In addition, algorithms in the field of music style transfer have poor effect on the style transfer of music with vocals. Therefore, this paper extracts the CQT features and Mel spectrogram features of music, and then uses CycleGAN to transfer the styles of the CQT features and Mel spectrogram mapping pictures, and finally realizes the style transfer of music. On the classifier we trained, the average style transfer rate of music that meets our requirements reached 94.07%.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 7
  • 10.3390/app12136595
An Acoustic Feature-Based Deep Learning Model for Automatic Thai Vowel Pronunciation Recognition
  • Jun 29, 2022
  • Applied Sciences
  • Niyada Rukwong + 1 more

For Thai vowel pronunciation, it is very important to know that when mispronunciation occurs, the meanings of words change completely. Thus, effective and standardized practice is essential to pronouncing words correctly as a native speaker. Since the COVID-19 pandemic, online learning has become increasingly popular. For example, an online pronunciation application system was introduced that has virtual teachers and an intelligent process of evaluating students that is similar to standardized training by a teacher in a real classroom. This research presents an online automatic computer-assisted pronunciation training (CAPT) using deep learning to recognize Thai vowels in speech. The automatic CAPT is developed to solve the inadequacy of instruction specialists and the complex vowel teaching process. It is a unique system that develops computer techniques integrated with linguistic theory. The deep learning model is the most significant part of recognizing vowels pronounced for the automatic CAPT. The major challenge in Thai vowel recognition is the correct identification of Thai vowels when spoken in real-world situations. A convolutional neural network (CNN), a deep learning model, is applied and developed in the classification of pronounced Thai vowels. A new dataset for Thai vowels was designed, collected, and examined by linguists. The result of an optimal CNN model with Mel spectrogram (MS) achieves the highest accuracy of 98.61%, compared with Mel frequency cepstral coefficients (MFCC) with the baseline long short-term memory (LSTM) model and MS with the baseline LSTM model have an accuracy of 94.44% and 90.00% respectively.

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/cis54983.2021.00032
Adversarial Training with Gated Convolutional Neural Networks for Robust Speech Recognition
  • Nov 1, 2021
  • Xudong Lv + 2 more

In natural environments, the performance of automatic speech recognition systems is often affected by environmental noise. The noise data augmentation method is commonly used to boost acoustic models’ robustness; however, audios with background noise may degrade the acoustic model's performance in clean audios. In this paper, we propose an approach of adversarial training with gated convolutional neural networks for robust speech recognition. We use generative adversarial networks and gated convolutional neural networks to allow the acoustic model to learn noise-invariant information. Specifically, we choose the first several layers of the acoustic model as the generator model. Systematic experiments on aishell-1 show that adversarial training with gated convolutional neural networks boosts the robustness of the acoustic model in noisy environments and improves the performance of the acoustic model in quiet environments. Compared with the simple noise data augmentation training method, adversarial training with gated convolutional neural networks reduces the average relative error rate by 4.4% on the clean test data and 5.6% on the noisy test data.

  • Research Article
  • 10.3390/app15031337
Age Prediction from Korean Speech Data Using Neural Networks with Diverse Voice Features
  • Jan 27, 2025
  • Applied Sciences
  • Hayeon Ku + 4 more

A person’s voice serves as an indicator of age, as it changes with anatomical and physiological influences throughout their life. Although age prediction is a subject of interest across various disciplines, age-prediction studies using Korean voices are limited. The few studies that have been conducted have limitations, such as the absence of specific age groups or detailed age categories. Therefore, this study proposes an optimal combination of speech features and deep-learning models to recognize detailed age groups using a large Korean-speech dataset. From the speech dataset, recorded by individuals ranging from their teens to their 50s, four speech features were extracted: the Mel spectrogram, log-Mel spectrogram, Mel-frequency cepstral coefficients (MFCCs), and ΔMFCCs. Using these speech features, four deep-learning models were trained: ResNet-50, 1D-CNN, 2D-CNN, and a vision transformer. A performance comparison of speech feature-extraction methods and models indicated that MFCCs + ΔMFCCs was the best for both sexes when trained on the 1D-CNN model; it achieved an accuracy of 88.16% for males and 81.95% for females. The results of this study are expected to contribute to the future development of Korean speaker-recognition systems.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant