Abstract
Similar to humans' cognitive ability to generalize knowledge and skills, self-supervised learning (SSL) targets discovering general representations from large-scale data. This, through the use of pre-trained SSL models for downstream tasks, alleviates the need for human annotation, which is an expensive and time-consuming task. Its success in the fields of computer vision and natural language processing have prompted its recent adoption into the field of audio and speech processing. Comprehensive reviews summarizing the knowledge in audio SSL are currently missing. To fill this gap, we provide an overview of the SSL methods used for audio and speech processing applications. Herein, we also summarize the empirical works that exploit audio modality in multi-modal SSL frameworks and the existing suitable benchmarks to evaluate the power of SSL in the computer audition domain. Finally, we discuss some open problems and point out the future directions in the development of audio SSL.
Full Text
Topics from this Paper
Self-supervised Learning
Audio Processing
Downstream Tasks
Google Scholar
Pre-trained Models
+ Show 5 more
Create a personalized feed of these topics
Get StartedTalk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Similar Papers
Control theory & applications
May 28, 2021
Aug 30, 2021
Remote Sensing
Nov 3, 2023
Dec 21, 2022
iScience
Sep 1, 2022
ACM Transactions on Knowledge Discovery from Data
Jun 28, 2023
IEEE Journal of Selected Topics in Signal Processing
Oct 1, 2022
Applied Soft Computing
Aug 1, 2023
Jul 18, 2022
Sep 12, 2021
Frontiers in Earth Science
Jan 10, 2023
Jan 1, 2022
Pattern Recognition Letters
Feb 1, 2022
Journal of Healthcare Engineering
Mar 15, 2022
Jun 1, 2021
Patterns
Patterns
Nov 1, 2023
Patterns
Nov 1, 2023
Patterns
Nov 1, 2023
Patterns
Nov 1, 2023
Patterns
Nov 1, 2023
Patterns
Nov 1, 2023
Patterns
Nov 1, 2023
Patterns
Nov 1, 2023
Patterns
Nov 1, 2023