Target Speaker Research Articles

Voice cues, such as fundamental frequency (F0) and vocal tract length (VTL), help listeners identify the speaker's gender, perceive the linguistic and emotional prosody, and segregate competing talkers. Postlingually implanted adult cochlear implant (CI) users seem to have difficulty in perceiving and making use of voice cues, especially of VTL. Early implanted child CI users, in contrast, perceive and make use of both voice cues better than CI adults, and in patterns similar to their peers with normal hearing (NH).In our study, we investigated the perception and use of voice cues in children with single-sided deafness (SSD) who received their CI at an early age (SSD+CI), in an attempt to bridge the gap between these two groups. The SSD+CI children have access to bilateral auditory information and often receive their CI at an early age, similar to CI children. They may also have dominant acoustic representations, similar to CI adults who acquired hearing loss at a later age. As such, the current study aimed to investigate the perception and use of voice cues by a group of nine early-implanted children with prelingual SSD. The study consisted of three experiments: F0 and VTL discrimination, voice gender categorization, and speech-in-speech perception. In each experiment, the results of the SSD group are compared to children and adults with CIs (for their CI ear) and with typical hearing (for their NH ear).Overall, the SSD+CI children had poorer VTL detection thresholds with their CI compared to their NH ear, while their F0 perception was similar across ears. Detection thresholds for both F0 and VTL with their CI ear was comparable to those of bilaterally implanted CI children, suggesting that SSD+CI children do not only rely on their NH ear, but actually make use of their CI. SSD+CI children relied more heavily on F0 cues than on VTL cues for voice gender categorization, with cue weighting patterns comparable to those of CI adults. In contrast to CI children, the SSD+CI children showed limited speech perception benefit based on F0 and VTL differences between the target and masker speaker, which again corresponded to the results of CI adults. Altogether, the SSD+CI children make good use of their CI, despite a good-hearing ear, however, the perceptual patterns seem to fall in-between those of CI children and CI adults. Perhaps a combination of childhood neuroplasticity, limited experience with relying only on the CI, and a dominant acoustic representation of voice gender explain these results.

Adults heard recordings of two spatially separated speakers reading newspaper and magazine articles. They were asked to listen to one of them and ignore the other, and EEG was recorded to assess their neural processing. Machine learning extracted neural sources that tracked the target and distractor speakers at three levels: the acoustic envelope of speech (delta- and theta-band modulations), lexical frequency for individual words, and the contextual predictability of individual words estimated by GPT-4 and earlier lexical models. To provide a broader view of speech perception, half of the subjects completed a simultaneous visual task, and the listeners included both native and non-native English speakers. Distinct neural components were extracted for these levels of auditory and lexical processing, demonstrating that native English speakers had greater target-distractor separation compared to non-native English speakers on most measures, and that lexical processing was reduced by the visual task. Moreover, there was a novel interaction of lexical predictability and frequency with auditory processing; acoustic tracking was stronger for lexically harder words, suggesting that people listened harder to the acoustics when needed for lexical selection. This demonstrates that speech perception is not simply a feedforward process from acoustic processing to the lexicon. Rather, the adaptable context-sensitive processing long known to occur at a lexical level has broader consequences for perception, coupling with the acoustic tracking of individual speakers in noise.Significance Statement In challenging listening conditions, people use focused attention to help understand individual talkers and ignore others, which changes their neural processing for speech at auditory through lexical levels. However, lexical processing for natural materials (e.g., conversations, audiobooks) has been difficult to measure, because of limitations of tools to estimate the predictability of individual words in longer discourses. The present investigation uses a contemporary large language model, GPT-4, to estimate word predictability, and demonstrates that listeners make online adaptations to their auditory neural processing in accord with these predictions; neural activity more closely tracks the acoustics of the target talker when words are less predictable from the context.

Target Speaker Research Articles

Related Topics

Articles published on Target Speaker

CycleDiffusion: Voice Conversion Using Cycle-Consistent Diffusion Models

Perception of voice cues and speech-in-speech by children with prelingual single-sided deafness and a cochlear implant

Zero-shot voice conversion based on feature disentanglement

Target speaker filtration by mask estimation for source speaker traceability in voice conversion

Neural tracking of speech acoustics in noise is coupled with lexical predictability as estimated by large language models.

DGSD: Dynamical graph self-distillation for EEG-based auditory spatial attention detection

Using deep learning to improve the intelligibility of a target speaker in noisy multi-talker environments for people with normal hearing and hearing loss.

X-TF-GridNet: A time–frequency domain target speaker extraction network with adaptive speaker embedding fusion

Optimizing feature fusion for improved zero-shot adaptation in text-to-speech synthesis

Gated Cross-Attention for Universal Speaker Extraction: Toward Real-World Applications

MRMI-TTS: Multi-Reference Audios and Mutual Information Driven Zero-Shot Voice Cloning

Improving auditory attention decoding by classifying intracranial responses to glimpsed and masked acoustic events

A GRU–CNN model for auditory attention detection using microstate and recurrence quantification analysis

Phoneme Hallucinator: One-Shot Voice Conversion via Set Expansion

Enhancing Zero-Shot Multi-Speaker TTS with Negated Speaker Representations

Auditory processing in neurodiverse children

Attention-based speech feature transfer between speakers.

Noise-robust voice conversion using adversarial training with multi-feature decoupling

AS2T: Arbitrary Source-To-Target Adversarial Attack on Speaker Recognition Systems

Prompt Tuning of Deep Neural Networks for Speaker-adaptive Visual Speech Recognition.

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Target Speaker Research Articles

Related Topics

Articles published on Target Speaker

CycleDiffusion: Voice Conversion Using Cycle-Consistent Diffusion Models

Perception of voice cues and speech-in-speech by children with prelingual single-sided deafness and a cochlear implant

Zero-shot voice conversion based on feature disentanglement

Target speaker filtration by mask estimation for source speaker traceability in voice conversion

Neural tracking of speech acoustics in noise is coupled with lexical predictability as estimated by large language models.

DGSD: Dynamical graph self-distillation for EEG-based auditory spatial attention detection

Using deep learning to improve the intelligibility of a target speaker in noisy multi-talker environments for people with normal hearing and hearing loss.

X-TF-GridNet: A time–frequency domain target speaker extraction network with adaptive speaker embedding fusion

Optimizing feature fusion for improved zero-shot adaptation in text-to-speech synthesis

Gated Cross-Attention for Universal Speaker Extraction: Toward Real-World Applications

MRMI-TTS: Multi-Reference Audios and Mutual Information Driven Zero-Shot Voice Cloning

Improving auditory attention decoding by classifying intracranial responses to glimpsed and masked acoustic events

A GRU–CNN model for auditory attention detection using microstate and recurrence quantification analysis

Phoneme Hallucinator: One-Shot Voice Conversion via Set Expansion

Enhancing Zero-Shot Multi-Speaker TTS with Negated Speaker Representations

Auditory processing in neurodiverse children

Attention-based speech feature transfer between speakers.

Noise-robust voice conversion using adversarial training with multi-feature decoupling

AS2T: Arbitrary Source-To-Target Adversarial Attack on Speaker Recognition Systems

Prompt Tuning of Deep Neural Networks for Speaker-adaptive Visual Speech Recognition.