Recent studies have been proposed to extract speech signals from captured videos of objects vibrating by sound waves. Among them, a method for extracting speech signals from videos captured by a rolling-shutter camera, which is widely used, has been attracting attention. A rolling-shutter camera records image data in one row of pixels at a time, thereby capturing the vibration of objects caused by sound waves. However, there are time intervals between frames of the videos, resulting in missing segments in the extracted speech signals. The conventional method uses an autoregressive model to interpolate these missing segments. However, the conventional method ignores the noise in the extracted speech signals, and therefore the noise remains. In this paper, we propose a method to interpolate missing segments based on singular spectrum analysis, which considers the noise to further improve speech quality, by dual rolling-shutter cameras. By using the singular spectral analysis, the missing segments can be determined using only the speech components in the signals, which are related to large singular value, thereby reducing the noise. Experimental results show that the proposed method outperforms the conventional methods in terms of quality and intelligibility of the extracted speech signals.
Read full abstract