Recent studies have been proposed to extract speech from the captured video of objects vibrating by sound waves. Among them, from the viewpoint of equipment cost, the method of extracting speech from the video captured by rolling-shutter cameras, which are widely used in consumer digital
single-lens reflex cameras, has been attracting attention. The conventional method with the rolling-shutter video uses a grayscale video for processing based on phase images. However, a grayscale video has a smaller dynamic range than an RGB video, and thus the speech extraction accuracy of
the conventional method degrades. Therefore, this paper proposes a speech extraction method based on RGB-intensity gradients on an RGB video to improve speech extraction accuracy. The proposed method extracts the speech by calculating the similarity of R, G, and B intensity gradients, and
using these three intensity gradients expands the dynamic range. The experimental results on the quality and intelligibility of the extracted speech show our proposed method outperforms the conventional method.