Speech Recognition Research Articles

Speech-based automatic depression detection systems have been extensively explored over the past few years. Typically, each speaker is assigned a single label (Depressive or Non-depressive), and most approaches formulate depression detection as a speech classification task without explicitly considering the non-uniformly distributed depression pattern within segments, leading to low generalizability and robustness across different scenarios. However, depression corpora do not provide fine-grained labels (at the phoneme or word level) which makes the dynamic depression pattern in speech segments harder to track using conventional frameworks. To address this, we propose a novel framework, Speechformer-CTC, to model non-uniformly distributed depression characteristics within segments using a Connectionist Temporal Classification (CTC) objective function without the necessity of input–output alignment. Two novel CTC-label generation policies, namely the Expectation-One-Hot and the HuBERT policies, are proposed and incorporated in objectives on various granularities. Additionally, experiments using Automatic Speech Recognition (ASR) features are conducted to demonstrate the compatibility of the proposed method with content-based features. Our results show that the performance of depression detection, in terms of Macro F1-score, is improved on both DAIC-WOZ (English) and CONVERGE (Mandarin) datasets. On the DAIC-WOZ dataset, the system with HuBERT ASR features and a CTC objective optimized using HuBERT policy for label generation achieves 83.15% F1-score, which is close to state-of-the-art without the need for phoneme-level transcription or data augmentation. On the CONVERGE dataset, using Whisper features with the HuBERT policy improves the F1-score by 9.82% on CONVERGE1 (in-domain test set) and 18.47% on CONVERGE2 (out-of-domain test set). These findings show that depression detection can benefit from modeling non-uniformly distributed depression patterns and the proposed framework can be potentially used to determine significant depressive regions in speech utterances.

Reverberation is one of the most critical obstacles to adopt automatic speech recognition (ASR) in real life environments. Therefore, comprehensive understanding on the effect of reverberation to ASR is required to design robust ASR systems for practical uses. To deepen our understanding on the effect of reverberation to practical ASR, we performed a phonemic analysis on commercial ASR system. The analysis method involves a new metric named mean phoneme coherence (MPC), defined by time–frequency-averaged coherence function between clean and reverberated speech spectrograms of each phoneme. MPC measures the amount of spectral contamination on phonemes under certain reverberation condition thus quantifies not only the amount of reverberation on the phonemes but also contextual influences on the phoneme within sentence spoken in the reverberation condition. MPC was proven to represent the amount of reverberation and intelligibility of speeches under given reverberation condition by comparing MPC with word error rate (WER) in real reverberation conditions. Furthermore, the relationship between phoneme groups’ vulnerability to spectral contamination and ASR performance upon reverberation is analyzed by comparing median of phoneme groups’ MPC distribution with phoneme group word accuracy (PGWA). Analysis has shown that the two quantities show weak correlation, thus reverberation differently affects the intelligibility of phonemes. In addition, a comparative study among phoneme groups has shown that nasals and semivowels show the least robust ASR performances to reverberation while nasals and stops are most vulnerable to cause spectral contamination. The results and discussions present what should be taken into account for ASR robust to reverberation.

Speech Recognition Research Articles

Related Topics

Articles published on Speech Recognition

Enhanced multi-ethnic speech recognition using pitch shifting generative adversarial networks

Speechformer-CTC: Sequential modeling of depression detection with speech temporal classification

Relational-branchformer: Novel framework for audio-visual speech recognition

Deep Learning for Visual Speech Analysis: A Survey.

Enhancing the English natural language processing dictionary using natural language processing++

Musician Advantage for Segregation of Competing Speech in Native Tonal Language Speakers

Investigating the Effects of Artificial Intelligence-Assisted Language Learning Strategies on Cognitive Load and Learning Outcomes: A Comparative Study

Using AI-Powered Speech Recognition Technology to Improve English Pronunciation and Speaking Skills

Comprehensive multiparametric analysis of human deepfake speech recognition

A blended framework for audio spoof detection with sequential models and bags of auditory bites.

Benefits of Cochlear Implantation for Older Adults With Asymmetric Hearing Loss.

A Study on the Impact of Voice-to-Text Technology on Academic Achievement of the Hearing-Impaired

Recurrent Neural Networks: A Comprehensive Review of Architectures, Variants, and Applications

The relationship and interdependence of auditory thresholds, proposed behavioural measures of hidden hearing loss, and physiological measures of auditory function

Integration of Technology in Arabic Language Teaching in Writing and Speaking Skills

Significance of chirp MFCC as a feature in speech and audio applications

Automatic speech recognition (ASR) for the diagnosis of pronunciation of speech sound disorders in Korean children.

Coherence-based phonemic analysis on the effect of reverberation to practical automatic speech recognition

Biobased, degradable and directional porous carboxymethyl chitosan/lignosulfonate sodium aerogel-based piezoresistive pressure sensor with dual-conductive network for human motion detection

Helicopter cockpit speech recognition method based on transfer learning and context biasing

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Speech Recognition Research Articles

Related Topics

Articles published on Speech Recognition

Enhanced multi-ethnic speech recognition using pitch shifting generative adversarial networks

Speechformer-CTC: Sequential modeling of depression detection with speech temporal classification

Relational-branchformer: Novel framework for audio-visual speech recognition

Deep Learning for Visual Speech Analysis: A Survey.

Enhancing the English natural language processing dictionary using natural language processing++

Musician Advantage for Segregation of Competing Speech in Native Tonal Language Speakers

Investigating the Effects of Artificial Intelligence-Assisted Language Learning Strategies on Cognitive Load and Learning Outcomes: A Comparative Study

Using AI-Powered Speech Recognition Technology to Improve English Pronunciation and Speaking Skills

Comprehensive multiparametric analysis of human deepfake speech recognition

A blended framework for audio spoof detection with sequential models and bags of auditory bites.

Benefits of Cochlear Implantation for Older Adults With Asymmetric Hearing Loss.

A Study on the Impact of Voice-to-Text Technology on Academic Achievement of the Hearing-Impaired

Recurrent Neural Networks: A Comprehensive Review of Architectures, Variants, and Applications

The relationship and interdependence of auditory thresholds, proposed behavioural measures of hidden hearing loss, and physiological measures of auditory function

Integration of Technology in Arabic Language Teaching in Writing and Speaking Skills

Significance of chirp MFCC as a feature in speech and audio applications

Automatic speech recognition (ASR) for the diagnosis of pronunciation of speech sound disorders in Korean children.

Coherence-based phonemic analysis on the effect of reverberation to practical automatic speech recognition

Biobased, degradable and directional porous carboxymethyl chitosan/lignosulfonate sodium aerogel-based piezoresistive pressure sensor with dual-conductive network for human motion detection

Helicopter cockpit speech recognition method based on transfer learning and context biasing