Representation Of Speech Research Articles

High-dimensional data such as natural images or speech signals exhibit some form of regularity, preventing their dimensions from varying independently. This suggests that there exists a lower dimensional latent representation from which the high-dimensional observed data were generated. Uncovering the hidden explanatory features of complex data is the goal of representation learning, and deep latent variable generative models have emerged as promising unsupervised approaches. In particular, the variational autoencoder (VAE) which is equipped with both a generative and an inference model allows for the analysis, transformation, and generation of various types of data. Over the past few years, the VAE has been extended to deal with data that are either multimodal or dynamical (i.e., sequential). In this paper, we present a multimodal and dynamical VAE (MDVAE) applied to unsupervised audiovisual speech representation learning. The latent space is structured to dissociate the latent dynamical factors that are shared between the modalities from those that are specific to each modality. A static latent variable is also introduced to encode the information that is constant over time within an audiovisual speech sequence. The model is trained in an unsupervised manner on an audiovisual emotional speech dataset, in two stages. In the first stage, a vector quantized VAE (VQ-VAE) is learned independently for each modality, without temporal modeling. The second stage consists in learning the MDVAE model on the intermediate representation of the VQ-VAEs before quantization. The disentanglement between static versus dynamical and modality-specific versus modality-common information occurs during this second training stage. Extensive experiments are conducted to investigate how audiovisual speech latent factors are encoded in the latent space of MDVAE. These experiments include manipulating audiovisual speech, audiovisual facial image denoising, and audiovisual speech emotion recognition. The results show that MDVAE effectively combines the audio and visual information in its latent space. They also show that the learned static representation of audiovisual speech can be used for emotion recognition with few labeled data, and with better accuracy compared with unimodal baselines and a state-of-the-art supervised model based on an audiovisual transformer architecture.

Read full abstract

The object of the study is dialogue texts with signs of corrupt speech behavior. The subject of the study are marker words, linguistic methods that reveal linguistic signs of a discourse that has a meaningful and semantic focus on receiving or transferring values for the actions of the recipient (inaction), in favor of the giver. The purpose of this work was to conduct a communicative analysis of corrupt speech behavior in the text–dialogue: to accept illegal remuneration for actions (inaction) in the service in order to obtain valuables in the event of an agreement between the specified persons.The main task of the linguistic and forensic examination of anti-corruption cases is to establish the event by its speech representation: it is required to prove that the dialogue is about the transfer of funds. To solve this issue, it is necessary to conduct a communicative analysis of the text-dialogue. The research methodology is related to the communicative analysis of texts-dialogues with signs of corruption content, based on the works of D.L. Karpov, "on the theoretical and practical works of A. N. Baranov, K. I. Brinev, E. N. Galyashina, M. A. Grachev, I. A. Sternin, etc., devoted to cases related to corruption. The scientific novelty of the work lies in the use of a communicative approach within the framework of text-dialogue analysis, logical-semantic and functional-stylistic analysis. The analysis of the expert study of the text-dialogue on anti-corruption cases is presented, the conceptual framework is developed. As a result of the study, it was concluded that the main task of linguistic expertise in anti-corruption cases is to establish an event by its speech representation through the use of a set of semantic and pragmatic techniques of discursive analysis, which allows us to clarify the signs of conversational dialogue that are significant for the legal qualification of speech crimes. When conducting a communicative analysis of a corrupt text, it is important to identify signs of corrupt speech behavior. When analyzing a communicative corruption situation, it is necessary to consider the speech strategies and tactics of the participants in the dialogue, which ultimately allow us to determine the true intentions of the communicants aimed at achieving a communicative goal.

Read full abstract

Representation Of Speech Research Articles

Related Topics

Articles published on Representation Of Speech

Understanding Self-Supervised Learning of Speech Representation via Invariance and Redundancy Reduction

LeBenchmark 2.0: A standardized, replicable and enhanced framework for self-supervised representations of French speech

Preserved Gray Matter Volume in the Left Superior Temporal Gyrus Underpins Speech-in-Noise Processing in Middle-Aged Adults.

Speaker voice normalization for end-to-end speech translation

The role of vowel and consonant onsets in neural tracking of natural speech

A multimodal dynamical variational autoencoder for audiovisual speech representation learning

Features of linguistic expertise of texts of dialogues on anti-corruption cases

Sensitive Quantification of Cerebellar Speech Abnormalities Using Deep Learning Models.

Varieties of ironic objection in popular science Internet media text

ZMM-TTS: Zero-Shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-Supervised Discrete Speech Representations

Research on the Innovation of University English Teaching Mode Driven by Artificial Intelligence

QS-TTS: Towards Semi-Supervised Text-to-Speech Synthesis via Vector-Quantized Self-Supervised Speech Representation Learning

Adapting Pre-Trained Self-Supervised Learning Model for Speech Recognition with Light-Weight Adapters

VatLM: Visual-Audio-Text Pre-Training With Unified Masked Prediction for Speech Representation Learning

Gender Role and Hate Speech Representation in Web Series

Dialect in the Making

Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision

Improving speech emotion recognition by fusing self-supervised learning and spectral features via mixture of experts

Applying the Lombard Effect to Speech-in-Noise Communication

Interpreting Convolutional Layers in DNN Model Based on Time–Frequency Representation of Emotional Speech

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Representation Of Speech Research Articles

Related Topics

Articles published on Representation Of Speech

Understanding Self-Supervised Learning of Speech Representation via Invariance and Redundancy Reduction

LeBenchmark 2.0: A standardized, replicable and enhanced framework for self-supervised representations of French speech

Preserved Gray Matter Volume in the Left Superior Temporal Gyrus Underpins Speech-in-Noise Processing in Middle-Aged Adults.

Speaker voice normalization for end-to-end speech translation

The role of vowel and consonant onsets in neural tracking of natural speech

A multimodal dynamical variational autoencoder for audiovisual speech representation learning

Features of linguistic expertise of texts of dialogues on anti-corruption cases

Sensitive Quantification of Cerebellar Speech Abnormalities Using Deep Learning Models.

Varieties of ironic objection in popular science Internet media text

ZMM-TTS: Zero-Shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-Supervised Discrete Speech Representations

Research on the Innovation of University English Teaching Mode Driven by Artificial Intelligence

QS-TTS: Towards Semi-Supervised Text-to-Speech Synthesis via Vector-Quantized Self-Supervised Speech Representation Learning

Adapting Pre-Trained Self-Supervised Learning Model for Speech Recognition with Light-Weight Adapters

VatLM: Visual-Audio-Text Pre-Training With Unified Masked Prediction for Speech Representation Learning

Gender Role and Hate Speech Representation in Web Series

Dialect in the Making

Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision

Improving speech emotion recognition by fusing self-supervised learning and spectral features via mixture of experts

Applying the Lombard Effect to Speech-in-Noise Communication

Interpreting Convolutional Layers in DNN Model Based on Time–Frequency Representation of Emotional Speech