VoxCeleb1 Dataset Research Articles

Image animation aims to transfer the posture change of a driving video to the static object of the source image, and has potential applications in various domains, such as film and game industries. The essential part in this task is to generate a video by learning the motion from the driving video while preserving the appearance from the source image. As a result, a new object with the same motion will be generated in the animated video. However, it is a significant challenge if the object pose shows large-scale change. Even the most recent method failed to achieve this correctly with good visual effects. In order to solve the problem of poor visual effects in the videos with the large-scale pose change, a novel method based on an improved first-order motion model (FOMM) with enhanced dense motion and repair ability was proposed in this paper. Firstly, when generating optical flow, we propose an attention mechanism that optimizes the feature representation of the image in both channel and spatial domains through maximum pooling. This enables better distortion of the source image into the feature domain of the driving image. Secondly, we further propose a multi-scale occlusion restoration module that generates a multi-resolution occlusion map by upsampling the low-resolution occlusion map. Following this, the generator redraws the occluded part of the reconstruction result across multiple scales through the multi-resolution occlusion map to achieve more accurate and vivid visual effects. In addition, the proposed model can be trained effectively in an unsupervised manner. We evaluated the proposed model on three benchmark datasets. The experimental results showed that multiple evaluation indicators were improved by our proposed method, and the visual effect of the animated videos obviously outperformed the FOMM. On the Voxceleb1 dataset, the pixel error, average keypoints distance and average Euclidean distance by our proposed method were reduced by 6.5%, 5.1% and 0.7%, respectively. On the TaiChiHD dataset, the pixel error, average keypoints distance and missing keypoints rate measured by our proposed method were reduced by 4.9%, 13.5% and 25.8%, respectively.

Read full abstract

The performance of speaker recognition systems is very well on the datasets without noise and mismatch. However, the performance gets degraded with the environmental noises, channel variation, physical and behavioral changes in speaker. The types of Speaker related feature play crucial role in improving the performance of speaker recognition systems. Gammatone Frequency Cepstral Coefficient (GFCC) features has been widely used to develop robust speaker recognition systems with the conventional machine learning, it achieved better performance compared to Mel Frequency Cepstral Coefficient (MFCC) features in the noisy condition. Recently, deep learning models showed better performance in the speaker recognition compared to conventional machine learning. Most of the previous deep learning-based speaker recognition models has used Mel Spectrogram and similar inputs rather than a handcrafted features like MFCC and GFCC features. However, the performance of the Mel Spectrogram features gets degraded in the high noise ratio and mismatch in the utterances. Similar to Mel Spectrogram, Cochleogram is another important feature for deep learning speaker recognition models. Like GFCC features, Cochleogram represents utterances in Equal Rectangular Band (ERB) scale which is important in noisy condition. However, none of the studies have conducted analysis for noise robustness of Cochleogram and Mel Spectrogram in speaker recognition. In addition, only limited studies have used Cochleogram to develop speech-based models in noisy and mismatch condition using deep learning. In this study, analysis of noise robustness of Cochleogram and Mel Spectrogram features in speaker recognition using deep learning model is conducted at the Signal to Noise Ratio (SNR) level from −5 dB to 20 dB. Experiments are conducted on the VoxCeleb1 and Noise added VoxCeleb1 dataset by using basic 2DCNN, ResNet-50, VGG-16, ECAPA-TDNN and TitaNet Models architectures. The Speaker identification and verification performance of both Cochleogram and Mel Spectrogram is evaluated. The results show that Cochleogram have better performance than Mel Spectrogram in both speaker identification and verification at the noisy and mismatch condition.

Read full abstract

VoxCeleb1 Dataset Research Articles

Articles published on VoxCeleb1 Dataset

Enhancing Speaker Recognition Models with Noise-Resilient Feature Optimization Strategies

Audio-Visual Fusion Based on Interactive Attention for Person Verification.

Few-shot short utterance speaker verification using meta-learning.

Improved First-Order Motion Model of Image Animation with Enhanced Dense Motion and Repair Ability

Analyzing Noise Robustness of Cochleogram and Mel Spectrogram Features in Deep Learning Based Speaker Recognition

Few-shot re-identification of the speaker by social robots

Attention based gender and nationality information exploration for speaker identification

VoxCeleb1: Speaker Age-Group Classification using Probabilistic Neural Network

Gender and Age Estimation Methods Based on Speech Using Deep Neural Networks.

Age Estimation in Short Speech Utterances Based on Bidirectional Gated-Recurrent Neural Networks

Audio-Visual Deep Neural Network for Robust Person Verification

Federated Learning for Privacy-Preserving Speaker Recognition

Weighted Cluster-Range Loss and Criticality-Enhancement Loss for Speaker Recognition

SAMAF

Disentangled Speaker and Nuisance Attribute Embedding for Robust Speaker Verification

Speaker recognition using PCA-based feature transformation

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

VoxCeleb1 Dataset Research Articles

Articles published on VoxCeleb1 Dataset

Enhancing Speaker Recognition Models with Noise-Resilient Feature Optimization Strategies

Audio-Visual Fusion Based on Interactive Attention for Person Verification.

Few-shot short utterance speaker verification using meta-learning.

Improved First-Order Motion Model of Image Animation with Enhanced Dense Motion and Repair Ability

Analyzing Noise Robustness of Cochleogram and Mel Spectrogram Features in Deep Learning Based Speaker Recognition

Few-shot re-identification of the speaker by social robots

Attention based gender and nationality information exploration for speaker identification

VoxCeleb1: Speaker Age-Group Classification using Probabilistic Neural Network

Gender and Age Estimation Methods Based on Speech Using Deep Neural Networks.

Age Estimation in Short Speech Utterances Based on Bidirectional Gated-Recurrent Neural Networks

Audio-Visual Deep Neural Network for Robust Person Verification

Federated Learning for Privacy-Preserving Speaker Recognition

Weighted Cluster-Range Loss and Criticality-Enhancement Loss for Speaker Recognition

SAMAF

Disentangled Speaker and Nuisance Attribute Embedding for Robust Speaker Verification

Speaker recognition using PCA-based feature transformation