Deep learning: from speech recognition to language and multimodal processing

Li Deng

doi:10.1017/atsip.2015.22

Abstract

While artificial neural networks have been in existence for over half a century, it was not until year 2010 that they had made a significant impact on speech recognition with a deep form of such networks. This invited paper, based on my keynote talk given at Interspeech conference in Singapore in September 2014, will first reflect on the historical path to this transformative success, after providing brief reviews of earlier studies on (shallow) neural networks and on (deep) generative models relevant to the introduction of deep neural networks (DNN) to speech recognition several years ago. The role of well-timed academic-industrial collaboration is highlighted, so are the advances of big data, big compute, and the seamless integration between the application-domain knowledge of speech and general principles of deep learning. Then, an overview is given on sweeping achievements of deep learning in speech recognition since its initial success. Such achievements, summarized into six major areas in this article, have resulted in across-the-board, industry-wide deployment of deep learning in speech recognition systems. Next, more challenging applications of deep learning, natural language and multimodal processing, are selectively reviewed and analyzed. Examples include machine translation, knowledgebase completion, information retrieval, and automatic image captioning, where fresh ideas from deep learning, continuous-space embedding in particular, are shown to be revolutionizing these application areas albeit with less rapid pace than for speech and image recognition. Finally, a number of key issues in deep learning are discussed, and future directions are analyzed for perceptual tasks such as speech, image, and video, as well as for cognitive tasks involving natural language.

Highlights

The main theme of this paper is to reflect on the recent history of how deep learning has profoundly revolutionized the field of automatic speech recognition (ASR) and to elaborate on what kind of lessons we can learn to further advance ASR technology and to impact the related, arguably more important, applications in language and multimodal processing
The roles of generative models have been analyzed in the review, pointing out that the key advantages of embedding knowledge about speech dynamics that are naturally enabled by deep generative modeling have yet to be incorporated as part of the new-generation deep learning framework
One remaining future challenge lies in how to effectively integrate major relevant speech knowledge and problem constraints into new deep models of the future. Examples of such knowledge and constraints would include distributed, feature-based phonological representations of sound patterns of language via hierarchical structure based on modern phonology, articulatory dynamics, and motor program control, acoustic distortion mechanisms for the generation of noisy, reverberant speech in multi-speaker environments, Lombard effects caused by modification of articulatory behavior due to noise-induced reduction of communication effectiveness, and so on

Summary

INTRODUCTION

The main theme of this paper is to reflect on the recent history of how deep learning has profoundly revolutionized the field of automatic speech recognition (ASR) and to elaborate on what kind of lessons we can learn to further advance ASR technology and to impact the related, arguably more important, applications in language and multimodal processing. Semantic analysis of language and multimodal processing involving speech, text, and image, both experiencing rapid advances based on deep learning over the past few years, holds the potential to solve some difficult and remaining ASR problems and present new challenges for the deep learning technology.

SOME BRIEF HISTORY OF “DEEP” SPEECH RECOGNITION

ACHIEVEMENTS OF DEEP LEARNING IN SPEECH RECOGNITION

DEEP LEARNING FOR NATURAL LANGUAGE AND MULTIMODAL PROCESSING

Findings

FUTURE WORK

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: APSIPA Transactions on Signal and Information Processing	Publication Date: Jan 1, 2016
Citations: 53	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Deep learning: from speech recognition to language and multimodal processing

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: APSIPA Transactions on Signal and Information Processing

Lead the way for us

Similar Papers

Study of deep learning and CMU sphinx in automatic speech recognition
Abhishek Dhankar
-
Abhishek DhankarAbhishek Dhankar
01 Sep 2017
01 Sep 2017

Speech Enhancement and Recognition Using Deep Learning Algorithms: A Review
D Hepsiba ... L D Vijay Anand
-
D Hepsiba, et. al.D Hepsiba ... L D Vijay Anand
01 Jan 2023
01 Jan 2023

Deep learning for environmentally robust speech recognition
A I Alhamada ... A H Abdalla
-
A I Alhamada, et. al.A I Alhamada ... A H Abdalla
01 Jan 2020
01 Jan 2020

Using Machine Learning Algorithms Combined with Deep Learning in Speech Recognition
Vu Thanh Nguyen ... Mai Viet Tiep
-
Vu Thanh Nguyen, et. al.Vu Thanh Nguyen ... Mai Viet Tiep
01 Jan 2020
01 Jan 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Deep learning: from speech recognition to language and multimodal processing

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: APSIPA Transactions on Signal and Information Processing