Speech Waveform Research Articles

Single-channel, speaker-independent speech separation methods have recently seen great progress. However, the accuracy, latency, and computational cost of such methods remain insufficient. The majority of the previous methods have formulated the separation problem through the time-frequency representation of the mixed signal, which has several drawbacks, including the decoupling of the phase and magnitude of the signal, the suboptimality of time-frequency representation for speech separation, and the long latency of the entire system. To address these shortcomings, we propose a fully-convolutional time-domain audio separation network (Conv-TasNet), a deep learning framework for end-to-end time-domain speech separation. Conv-TasNet uses a linear encoder to generate a representation of the speech waveform optimized for separating individual speakers. Speaker separation is achieved by applying a set of weighting functions (masks) to the encoder output. The modified encoder representations are then inverted back to the waveforms using a linear decoder. The masks are found using a temporal convolutional network (TCN) consisting of stacked 1-D dilated convolutional blocks, which allows the network to model the long-term dependencies of the speech signal while maintaining a small model size. The proposed Conv-TasNet system significantly outperforms previous time-frequency masking methods in separating two- and three-speaker mixtures. Additionally, Conv-TasNet surpasses several ideal time-frequency magnitude masks in two-speaker speech separation as evaluated by both objective distortion measures and subjective quality assessment by human listeners. Finally, Conv-TasNet has a significantly smaller model size and a much shorter minimum latency, making it a suitable solution for both offline and real-time speech separation applications. This study therefore represents a major step toward the realization of speech separation systems for real-world speech processing technologies.

Read full abstract

Individuals with early-onset severe-profound bilateral hearing loss (S/PHL) manifest diverse levels of benefit and satisfaction with hearing aids (HAs), even with prescriptive HA fitting. Such fittings incorporate normal loudness values, but little is known about aided loudness outcomes in this population and how those outcomes affect benefit or satisfaction. To describe aided loudness growth and satisfaction with aided listening in experienced adult HA users with S/PHL. The Contour Test of loudness perception was administered to listeners with S/PHL in the aided sound field using broadband speech, band-limited speech, and warble tones. Patterns and slopes of resultant loudness growth functions were referenced to sound field results from listeners with normal hearing (NH). S/PHL listeners also rated their aided listening satisfaction. It was expected that (1) most S/PHL listeners would demonstrate steeper than normal aided loudness growth, (2) loudness normalization would be associated with better high-frequency detection thresholds and speech recognition, and (3) closer approximation to normal would yield greater satisfaction. Participants were paid college-student volunteers: 23 with S/PHL, long-term aided listening experience, and new HAs; 15 with NH. Participants rated loudness on four ascending runs per stimulus (5-dB increments) using categories defined in 1997 by Cox and colleagues. The region between the 10th and 90th percentiles of the NH distribution constituted local norms against which location and slope of the S/PHL functions were examined over the range from Quiet to Loud-but-OK. S/PHL functions were categorized on the basis of their configurations (locations/slopes) relative to the norms. Pattern of aided loudness was normalized or within 5 dB of the normal region on 37% of trials with sufficient data for analysis. Only one of the 23 S/PHL listeners did not demonstrate Normal/Near-normal loudness on any trials. Four nonnormal patterns were identified: Steep (recruitment-like; 38% of trials); Shifted right, with normal growth rate (10%); Hypersensitive, with most intensities louder than normal (10%); and Shallow, with decreasing growth rate (7%). Listeners with high-frequency average thresholds above 100 dB hearing loss or no phonemic-based speech-discrimination skill were less likely to display normalized loudness. Slope was within norms for 52% of S/PHL trials, most also having a Normal/Near-normal growth pattern. Regardless of measured loudness results, all but four listeners with S/PHL reported satisfactory hearing almost always or most of the time with their HAs in designated priority need areas. The variety of aided loudness growth patterns identified reflects the diversity known to characterize individuals with early-onset S/PHL. Loudness rating at the validation stage of HA fit with these listeners is likely to reveal nonnormal loudness, signaling need for further HA adjustment. High satisfaction, however, despite nonnormal loudness growth, suggests that listeners with poor auditory speech recognition may benefit more from aided loudness that supports pattern perception (via the time-intensity waveform of speech), different from most current-day prescription fits.

Read full abstract

Speech Waveform Research Articles

Related Topics

Articles published on Speech Waveform

Speech Emotion Classification Using Attention-Based LSTM

ConflictNET: End-to-End Learning for Speech-Based Conflict Intensity Estimation

High Throughput CORDIC Architecture Based 3D-DCT/IDCT Processor

Calibrating rhythms in L1 Japanese and Japanese accented English

Speech enhancement based on simple recurrent unit network

Towards emotion recognition from contextual information using machine learning

Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input

GlotNet—A Raw Waveform Model for the Glottal Excitation in Statistical Parametric Speech Synthesis

Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation.

Modification of Velvet Noise for Speech Waveform Generation by Using Vocoder-Based Speech Synthesizer

Phoneme boundary detection from speech: A rule based approach

Voice Conversion With CycleRNN-Based Spectral Mapping and Finely Tuned WaveNet Vocoder

Vocal Effort Based Speaking Style Conversion Using Vocoder Features and Parallel Learning

Hidden-Markov-model based statistical parametric speech synthesis for Marathi with optimal number of hidden states

HMM speech synthesis based on MDCT representation

A Comparison Between STRAIGHT, Glottal, and Sinusoidal Vocoding in Statistical Parametric Speech Synthesis

Speech Enhancement Using Heterogeneous Information

Mel-Cepstrum-Based Quantization Noise Shaping Applied to Neural-Network-Based Speech Waveform Synthesis

Patterns of Aided Loudness Growth in Experienced Adult Listeners with Early-Onset Severe-Profound Hearing Loss.

Automated depression analysis using convolutional neural networks from speech

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Speech Waveform Research Articles

Related Topics

Articles published on Speech Waveform

Speech Emotion Classification Using Attention-Based LSTM

ConflictNET: End-to-End Learning for Speech-Based Conflict Intensity Estimation

High Throughput CORDIC Architecture Based 3D-DCT/IDCT Processor

Calibrating rhythms in L1 Japanese and Japanese accented English

Speech enhancement based on simple recurrent unit network

Towards emotion recognition from contextual information using machine learning

Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input

GlotNet—A Raw Waveform Model for the Glottal Excitation in Statistical Parametric Speech Synthesis

Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation.

Modification of Velvet Noise for Speech Waveform Generation by Using Vocoder-Based Speech Synthesizer

Phoneme boundary detection from speech: A rule based approach

Voice Conversion With CycleRNN-Based Spectral Mapping and Finely Tuned WaveNet Vocoder

Vocal Effort Based Speaking Style Conversion Using Vocoder Features and Parallel Learning

Hidden-Markov-model based statistical parametric speech synthesis for Marathi with optimal number of hidden states

HMM speech synthesis based on MDCT representation

A Comparison Between STRAIGHT, Glottal, and Sinusoidal Vocoding in Statistical Parametric Speech Synthesis

Speech Enhancement Using Heterogeneous Information

Mel-Cepstrum-Based Quantization Noise Shaping Applied to Neural-Network-Based Speech Waveform Synthesis

Patterns of Aided Loudness Growth in Experienced Adult Listeners with Early-Onset Severe-Profound Hearing Loss.

Automated depression analysis using convolutional neural networks from speech