Individual Speaker Characteristics Research Articles

The speaker diarization task pertains to the automated differentiation of speakers within an audio recording, while lacking any prior information regarding the speakers. The introduction of the self-attention mechanism in End-to-End Neural Speaker Diarization (EEND) has elegantly resolved the issue of overlapping speakers. The Transformer model equipped with self-attention mechanism has shown great potential in collecting global information, yielding remarkable outcomes in various tasks. However, the individual speaker characteristics are predominantly reflected in the contextual information, which conventional self-attention would not adequately address. In this study, we propose a hierarchical encoders model to augment the encoders’ acquisition of speaker information in two distinct ways: (1) Constraining the perceptual field of the self-attentive mechanism with left-right windows or Gaussian weights to highlight contextual information; (2) Utilizing a pre-trained time-delay neural network based speaker embedding extractor to alleviate the shortcomings of speaker feature extraction ability. We evaluate the proposed methods on a simulated dataset of two speakers and a real conversation dataset. The model with the most favorable outcomes among the proposed enhancements achieves a diarization error rate of 7.74% on the simulated dataset and 21.92% on MagicData-RAMC after adaptation. These results compellingly demonstrate the efficacy of the proposed methods.

Read full abstract

Feature extraction of speaker information from speech signals is a key procedure for exploring individual speaker characteristics and also the most critical part in a speaker recognition system, which needs to preserve individual information while attenuating linguistic information. However, it is difficult to separate individual from linguistic information in a given utterance. For this reason, we investigated a number of potential effects on speaker individual information that arise from differences in articulation due to speaker-specific morphology of the speech organs, comparing English, Chinese and Korean. We found that voiced and unvoiced phonemes have different frequency distributions in speaker information and these effects are consistent across the three languages, while the effect of nasal sounds on speaker individuality is language dependent. Because these differences are confounded with speaker individual information, feature extraction is negatively affected. Accordingly, a new feature extraction method is proposed to more accurately detect speaker individual information by suppressing phoneme-related effects, where the phoneme alignment is required once in constructing a filter bank for phoneme effect suppression, but is not necessary in processing feature extraction. The proposed method was evaluated by implementing it in GMM speaker models for speaker identification experiments. It is shown that the proposed approach outperformed both Mel Frequency Cepstrum Coefficient (MFCC) and the traditional F-ratio (FFCC). The use of the proposed feature has reduced recognition errors by 32.1–67.3% for the three languages compared with MFCC, and by 6.6–31% compared with FFCC. When combining an automatic phoneme aligner with the proposed method, the result demonstrated that the proposed method can detect speaker individuality with about the same accuracy as that based on manual phoneme alignment.

Read full abstract

Individual Speaker Characteristics Research Articles

Related Topics

Articles published on Individual Speaker Characteristics

Speaker diarization with variants of self-attention and joint speaker embedding extractor

Speaker anonymization by modifying fundamental frequency and x-vector singular value

Source versus spectral cues in the perception of indexical features in speech

Frequency Warping for Speaker Adaptation in HMM-based Speech Synthesis

Detection of speaker individual information using a phoneme effect suppression method

Audiovisual perception in adverse conditions: Language, speaker and listener effects

Language and dialect identification by syllabic spectral features

Differences in speaker’s articulatory space: Their contribution to vowel gesture and acoustic pattern

An approach to the problem of regional accent in automatic speech recognition

Computer simulation of the fitting of hearing aids based on prediction of the discrimination ability of speech sounds

Individual speaker influence on relative intelligibility of esophageal speech and artificial larynx speech.

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Individual Speaker Characteristics Research Articles

Related Topics

Articles published on Individual Speaker Characteristics

Speaker diarization with variants of self-attention and joint speaker embedding extractor

Speaker anonymization by modifying fundamental frequency and x-vector singular value

Source versus spectral cues in the perception of indexical features in speech

Frequency Warping for Speaker Adaptation in HMM-based Speech Synthesis

Detection of speaker individual information using a phoneme effect suppression method

Audiovisual perception in adverse conditions: Language, speaker and listener effects

Language and dialect identification by syllabic spectral features

Differences in speaker’s articulatory space: Their contribution to vowel gesture and acoustic pattern

An approach to the problem of regional accent in automatic speech recognition

Computer simulation of the fitting of hearing aids based on prediction of the discrimination ability of speech sounds

Individual speaker influence on relative intelligibility of esophageal speech and artificial larynx speech.