TIMIT Speech Research Articles

Speech is produced by a nonlinear, dynamical Vocal Tract (VT) system, and is transmitted through multiple (air, bone and skin conduction) modes, as captured by the air, bone and throat microphones respectively. Speaker specific characteristics that capture this nonlinearity are rarely used as stand-alone features for speaker modeling, and at best have been used in tandem with well known linear spectral features to produce tangible results. This paper proposes Recurrent Plot (RP) embeddings as stand-alone, non-linear speaker-discriminating features. Two datasets, the continuous multimodal TIMIT speech corpus and the consonant-vowel unimodal syllable dataset, are used in this study for conducting closed-set speaker identification experiments. Experiments with unimodal speaker recognition systems show that RP embeddings capture the nonlinear dynamics of the VT system which are unique to every speaker, in all the modes of speech. The Air (A), Bone (B) and Throat (T) microphone systems, trained purely on RP embeddings perform with an accuracy of 95.81%, 98.18% and 99.74%, respectively. Experiments using the joint feature space of combined RP embeddings for bimodal (A–T, A–B, B–T) and trimodal (A–B–T) systems show that the best trimodal system (99.84% accuracy) performs on par with trimodal systems using spectrogram (99.45%) and MFCC (99.98%). The 98.84% performance of the B–T bimodal system shows the efficacy of a speaker recognition system based entirely on alternate (bone and throat) speech, in the absence of the standard (air) speech. The results underscore the significance of the RP embedding, as a nonlinear feature representation of the dynamical VT system that can act independently for speaker recognition. It is envisaged that speech recognition too will benefit from this nonlinear feature.

Read full abstract

Audio copy-move-forgery created by copying one or more segments of an audio file and pasting it in a different position within the same audio is one of the most widely used methods in the field of audio forensics. This type of forgery is easy to apply but difficult to detect in the case of post-processing operations applied to forged speech to hide traces of forgeries. This paper proposes a robust method for the detection and localization of the audio copy-move forgery using a keypoint-based approach to the Mel spectrogram representation of audio. In the proposed method, first, the Mel spectrogram image is created from the input audio. Then, SIFT keypoints are obtained from each RGB color channel of this image. The obtained keypoints from each channel are matched via feature vectors to reveal the clues of the forgery regions, and the image sub-blocks whose keypoints are determined to be the center are labeled as forged blocks. Then the blocks in the neighborhood of the forged blocks are investigated whether forged or not. The proposed post-processing stage completes the determination of the forged regions. This stage eliminates the possible false positives and marks the forged areas in the spectrogram image. The forged segments are marked in the audio file by utilizing the positions of the forged regions in the spectrogram image. Experimental studies are carried out on two pitch-based datasets, using TIMIT and Arabic Speech Corpus. The paper presents the detailed performance results of popular referenced studies on these datasets. The performance results prove that the proposed method is more robust against common post-processing operations such as noise addition, filtering operation, and especially compression operation.

Read full abstract

TIMIT Speech Research Articles

Related Topics

Articles published on TIMIT Speech

Speech preprocessing and enhancement based on joint time domain and time-frequency domain analysis.

Recurrence plot embeddings as short segment nonlinear features for multimodal speaker identification using air, bone and throat microphones

FRMDN: Flow-based Recurrent Mixture Density Network

Spectral degradations in the TIMIT, QuickSIN, NU-6, and other popular bandlimited speech materials

Mel spectrogram-based audio forgery detection using CNN

Detection of audio copy-move-forgery with novel feature matching on Mel spectrogram

Spiking Neural Networks with Improved Inherent Recurrence Dynamics for Sequential Learning

APhL aligner: A neural network forced-alignment system

Real-time pre-processing for improved feature extraction of noisy speech

VOP detection for read and conversation speech using CWT coefficients and phone boundaries

Detection of vowel segments in noise with ImageNet neural network architectures

An Improved Unsupervised Single‐Channel Speech Separation Algorithm for Processing Speech Sensor Signals

The LSTM Neural Network Based on Memristor

Automatic text-independent speaker verification using convolutional deep belief network

High level speaker specific features modeling in automatic speaker recognition system

VEP Detection for Read, Extempore and Conversation Speech

Maximum Feasible Subsystem Algorithms for Recovery of Compressively Sensed Speech

Maximum entropy PLDA for robust speaker recognition under speech coding distortion

Nonlinear waveform distortion: Assessment and detection of clipping on speech data and systems

Identity Vector Extraction by Perceptual Wavelet Packet Entropy and Convolutional Neural Network for Voice Authentication.

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

TIMIT Speech Research Articles

Related Topics

Articles published on TIMIT Speech

Speech preprocessing and enhancement based on joint time domain and time-frequency domain analysis.

Recurrence plot embeddings as short segment nonlinear features for multimodal speaker identification using air, bone and throat microphones

FRMDN: Flow-based Recurrent Mixture Density Network

Spectral degradations in the TIMIT, QuickSIN, NU-6, and other popular bandlimited speech materials

Mel spectrogram-based audio forgery detection using CNN

Detection of audio copy-move-forgery with novel feature matching on Mel spectrogram

Spiking Neural Networks with Improved Inherent Recurrence Dynamics for Sequential Learning

APhL aligner: A neural network forced-alignment system

Real-time pre-processing for improved feature extraction of noisy speech

VOP detection for read and conversation speech using CWT coefficients and phone boundaries

Detection of vowel segments in noise with ImageNet neural network architectures

An Improved Unsupervised Single‐Channel Speech Separation Algorithm for Processing Speech Sensor Signals

The LSTM Neural Network Based on Memristor

Automatic text-independent speaker verification using convolutional deep belief network

High level speaker specific features modeling in automatic speaker recognition system

VEP Detection for Read, Extempore and Conversation Speech

Maximum Feasible Subsystem Algorithms for Recovery of Compressively Sensed Speech

Maximum entropy PLDA for robust speaker recognition under speech coding distortion

Nonlinear waveform distortion: Assessment and detection of clipping on speech data and systems

Identity Vector Extraction by Perceptual Wavelet Packet Entropy and Convolutional Neural Network for Voice Authentication.