Speaker Identity Information Research Articles

Speech signals are valuable biomarkers for assessing an individual’s mental health, including identifying Major Depressive Disorder (MDD) automatically. A frequently used approach in this regard is to employ features related to speaker identity, such as speaker-embeddings. However, over-reliance on speaker identity features in mental health screening systems can compromise patient privacy. Moreover, some aspects of speaker identity may not be relevant for depression detection and could serve as a bias factor that hampers system performance. To overcome these limitations, we propose disentangling speaker-identity information from depression-related information. Specifically, we present four distinct disentanglement methods to achieve this — adversarial speaker identification (SID)-loss maximization (ADV), SID-loss equalization with variance (LEV), SID-loss equalization using Cross-Entropy (LECE) and SID-loss equalization using KL divergence (LEKLD). Our experiments, which incorporated diverse input features and model architectures, have yielded improved F1 scores for MDD detection and voice-privacy attributes, as quantified by Gain in Voice Distinctiveness (GVD) and De-Identification Scores (DeID). On the DAIC-WOZ dataset (English), LECE using ComparE16 features results in the best F1-Scores of 80% which represents the audio-only SOTA depression detection F1-Score along with a GVD of −1.1 dB and a DeID of 85%. On the EATD dataset (Mandarin), ADV using raw-audio signal achieves an F1-Score of 72.38% surpassing multi-modal SOTA along with a GVD of −0.89 dB dB and a DeID of 51.21%. By reducing the dependence on speaker-identity-related features, our method offers a promising direction for speech-based depression detection that preserves patient privacy.

Read full abstract

HE term Rich Transcription spans multiple areas in audio processing, and its study marks a broadening of the concerns of automatic speech recognition (ASR) to cover the affiliated areas necessary for maximally useful applications. Whereas classical speech recognition focuses purely on converting a sequence of audio words to a sequence of textual words—without regard for capitalization, punctuation, speaker identity, pragmatic intent, and other high-level information—rich transcription attempts to produce a more highly annotated and informative output. The study of rich transcription received a great impetus in 2002 when the Defense Advanced Research Projects Agency (DARPA) started the Effective Affordable Reusable Speech-to-Text (EARS) program. This program extended the previous HUB-4 and HUB-5 programs by adding an emphasis on metatdata extraction, in addition to traditional word recognition. The particular metadata tasks that were studied (http://nist.gov/speech/tests/rt/rt2004/fall/docs/rt04f-eval-planv14.doc) are as follows. • Speaker diarization: the problem of segmenting speech into regions where only one person is talking, and then linking together speech (possibly from disjoint regions of time) from the same speaker. • Identification of sentence-like units (SUs): the task of segmenting speech into units expressing separate thoughts or ideas, similar to sentences in written language, but taking into account that spoken language might not exhibit complete grammatical sentences. • Disfluency detection: the dual problems of detecting the speech locations where a fluent word stream is interrupted (interruption point detection), and identifying those words that need to be removed in order to obtain the fluent word sequence of the intended utterance. This involves the labeling of pause fillers (e.g., “uh”), edit words (e.g., “I mean”), and the words that the speaker meant to replace in a self-repair. Clearly, other forms and definitions of metadata are possible, and above tasks are offered only for illustrative purposes. While rich transcription adds a new emphasis on various forms of metadata annotation, it also maintains a strong focus on improving automatic speech recognition from a core word-error-rate point of view. This is reflected in the composition of the special issue, with about half the papers addressing ASR. Here, there is a great deal of current interest in topics such as discriminative training, the use of large amounts of training data, unsupervised and semisupervised training, and

Read full abstract

Speaker Identity Information Research Articles

Related Topics

Articles published on Speaker Identity Information

Self-Supervised Disentangled Representation Learning for Robust Target Speech Extraction

Noise-robust voice conversion using adversarial training with multi-feature decoupling

Enhancing accuracy and privacy in speech-based depression detection through speaker disentanglement

AnyoneNet: Synchronized Speech and Talking Head Generation for Arbitrary Persons

One-shot emotional voice conversion based on feature separation

Speaker recognition based on short utterance compensation method of generative adversarial networks

Closed-set speaker conditioned acoustic-to-articulatory inversion using bi-directional long short term memory network.

Addressing Text-Dependent Speaker Verification Using Singing Speech

Emotion transplantation through adaptation in HMM-based speech synthesis

Cross-speaker generalisation in two phoneme-level perceptual adaptation processes

A Study of Bilinear Models in Voice Conversion

Introduction to the Special Section on Rich Transcription

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Speaker Identity Information Research Articles

Related Topics

Articles published on Speaker Identity Information

Self-Supervised Disentangled Representation Learning for Robust Target Speech Extraction

Noise-robust voice conversion using adversarial training with multi-feature decoupling

Enhancing accuracy and privacy in speech-based depression detection through speaker disentanglement

AnyoneNet: Synchronized Speech and Talking Head Generation for Arbitrary Persons

One-shot emotional voice conversion based on feature separation

Speaker recognition based on short utterance compensation method of generative adversarial networks

Closed-set speaker conditioned acoustic-to-articulatory inversion using bi-directional long short term memory network.

Addressing Text-Dependent Speaker Verification Using Singing Speech

Emotion transplantation through adaptation in HMM-based speech synthesis

Cross-speaker generalisation in two phoneme-level perceptual adaptation processes

A Study of Bilinear Models in Voice Conversion

Introduction to the Special Section on Rich Transcription