Semi-Supervised Training of Transformer and Causal Dilated Convolution Network with Applications to Speech Topic Classification

Jinxiang Zeng,Zhiyi Li,Xiaolin Li,Du Zhang

doi:10.3390/app11125712

Jinxiang Zeng, Zhiyi Li + Show 2 more

Open Access

https://doi.org/10.3390/app11125712

Copy DOI

Abstract

Aiming at the audio event recognition problem of speech recognition, a decision fusion method based on the Transformer and Causal Dilated Convolutional Network (TCDCN) framework is proposed. This method can adjust the model sound events for a long time and capture the time correlation, and can effectively deal with the sparsity of audio data. At the same time, our dataset comes from audio clips cropped by YouTube. In order to reliably and stably identify audio topics, we extract different features and different loss function calculation methods to find the best model solution. The experimental results from different test models show that the TCDCN model proposed in this paper achieves better recognition results than the classification using neural networks and other fusion methods.

Highlights

With the development of Internet communication technology, the channels through which people can share and receive information have been greatly enriched
As short videos have the problems of lax content, high noise and more redundancy, when acquiring speech features Mel Frequency Cepstrum Coefficient (MFCC) after processing such as mute excision and noise processing, we introduce the attention mechanism Transformer to select the most relevant data to the target and transform the speech into a higher quality feature subset to deliver signals to the downstream model
The deep neural network model is built on the Keras deep learning framework with TensorFlow as the backend, and the Python programming language is used to complete the entire experiment

Summary

Introduction

With the development of Internet communication technology, the channels through which people can share and receive information have been greatly enriched. According to the information dissemination model SIR and Chramm’s Model, we can understand that the information sender and the information receiver must have a common field of experience in order to facilitate the transmission of information. In this experiment, a large amount of unlabeled data will be used and labeled data will be used for pattern recognition, and semi-supervised learning (SSL) will be used to improve the accuracy and speed of learning

Objectives

Methods

Results