Prerequisites for the development of the system of automatic comparison of video and audio tracks by the speaker’s articulation

Marsel Shakirzyanov,Marat Nuriyev,Ruslan Gibadullin,D Nazarov,A Juraeva

doi:10.1051/e3sconf/202341902029

Marsel Shakirzyanov, Marat Nuriyev + Show 3 more

Open Access

https://doi.org/10.1051/e3sconf/202341902029

Copy DOI

Abstract

Deep learning and reinforcement learning technologies are opening up new possibilities for the automatic matching of video and audio data. This article explores the key steps in developing such a system, from matching phonemes and lip movements to selecting appropriate machine-learning models. It also discusses the importance of getting the reward function right, the balance between exploitation and exploitation, and the complexities of collecting training data. The article emphasizes the importance of using pre-trained models and transfer learning, and the importance of correctly evaluating and interpreting results to improve the system and achieve high-quality content. The article focuses on the need to develop effective mapping quality metrics and visualization methods to fully analyze system performance and identify possible areas for improvement.

Full Text