Speech extraction with RGB-intensity gradient on rolling-shutter video

Tsubasa Yoshizawa,Takanobu Nishiura,Atsushi Yoshida,Kenta Iwai

doi:10.3397/in-2021-1753

Abstract

Recent studies have been proposed to extract speech from the captured video of objects vibrating by sound waves. Among them, from the viewpoint of equipment cost, the method of extracting speech from the video captured by rolling-shutter cameras, which are widely used in consumer digital single-lens reflex cameras, has been attracting attention. The conventional method with the rolling-shutter video uses a grayscale video for processing based on phase images. However, a grayscale video has a smaller dynamic range than an RGB video, and thus the speech extraction accuracy of the conventional method degrades. Therefore, this paper proposes a speech extraction method based on RGB-intensity gradients on an RGB video to improve speech extraction accuracy. The proposed method extracts the speech by calculating the similarity of R, G, and B intensity gradients, and using these three intensity gradients expands the dynamic range. The experimental results on the quality and intelligibility of the extracted speech show our proposed method outperforms the conventional method.

Full Text