Speaker Diarization based on Deep Learning Techniques: A Review

Thaer M Al-Hadithy,Mondher Frikha,Zaidoon Kamil Maseer

doi:10.1109/ismsit56059.2022.9932710

Abstract

Speaker diarization is a function that recognizes “who was speaking at the phase” by organizing video and audio recordings with sets that correspond to the presenter's personality. Speaker diarization approaches for multi-speaker audio recordings in the domain of speech recognition were developed in the first few years to allow speaker adaptive technology. For a long time, the self-contained use of these technologies grew in relevance as a means of facilitating meta-information for downstream speaker-specific difficulties, such as audio retrieval. Speaker diarization has made rapid progress in recent years, thanks to the advent of deep learning technology, which has led to ground-breaking shifts in research and practices in the field of speech implementation. The goal of this work is to provide an in- depth review of the studies published in the last 20 years. In the proposed paper, a total of 85 studies have been selected. In this paper, we tried to highlight recent improvements in speaker diarization and challenges to help the researchers who are working in this field.

Full Text