Abstract

A cross-modal speech-text retrieval method using interactive learning convolution automatic encoder (CAE) is proposed. First, an interactive learning autoencoder structure is proposed, including two inputs of speech and text, as well as processing links such as encoding, hidden layer interaction, and decoding, to complete the modeling of cross-modal speech-text retrieval. Then, the original audio signal is preprocessed and the Mel frequency cepstrum coefficient (MFCC) feature is extracted. In addition, the word bag model is used to extract the text features, and then the attention mechanism is used to combine the text and speech features. Through interactive learning CAE, the shared features of speech and text modes are obtained and then sent to modal classifier to identify modal information, so as to realize cross-modal voice text retrieval. Finally, experiments show that the performance of the proposed algorithm is better than that of the contrast algorithm in terms of recall rate, accuracy rate, and false recognition rate.

Highlights

  • With the development of communication and Internet technologies, a large amount of multimedia data has been produced

  • As for the information retrieval of voice, the information retrieval based on the content of voice itself is still in the research stage [3, 4]

  • Keyword retrieval based on voice is a method to realize human-computer command interaction

Read more

Summary

Introduction

With the development of communication and Internet technologies, a large amount of multimedia data has been produced. Keyword retrieval based on voice is a method to realize human-computer command interaction. In [27], in the construction process of mapping mechanism of multimodal information retrieval, the deep learning method avoids the feature extraction of single modal data and greatly improves the construction speed of the model. In [29], the author proposed an architecture called DenseNet-BiLSTM for keyword retrieval In their tasks, the keywords allowed to be input are selected from a group of command words, while the application scenarios of our model allow the input of any word or phrase of the language. Most of these methods assume that the semantic data of different modes have the same amount of information. E generative model embeds semantic label information and promotes the difference between modal-specific features and shared modal features to generate discriminative modal sharing and modal-specific representations. e discriminant model inputs the learned modal shared representations corresponding to the speech and text modalities into the modal classifier to identify modal information and realize cross-modal speech-text retrieval

Feature Extraction
Voice Keyword Search Implementation
Voice Keyword Retrieval Realization Experiment Scheme
48 GB GDDR6 M467 4

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.