Given an observed low frame rate video, video frame interpolation (VFI) aims to generate a high frame rate video, which has smooth video frames with higher frames per second (FPS). Most existing VFI methods often focus on generating one frame at a specific timestep, e.g., 0.5, between every two frames, thus lacking the flexibility to increase the video’s FPS by an arbitrary scale, e.g., 3. To better address this issue, in this paper, we propose an arbitrary timestep video frame interpolation (ATVFI) network with time-dependent decoding. Generally, the proposed ATVFI is an encoder–decoder architecture, where the interpolation timestep is an extra input added to the decoder network; this enables ATVFI to interpolate frames at arbitrary timesteps between input frames and to increase the video’s FPS at any given scale. Moreover, we propose a data augmentation method, i.e., multi-width window sampling, where video frames can be split into training samples with multiple window widths, to better leverage training frames for arbitrary timestep interpolation. Extensive experiments were conducted to demonstrate the superiority of our model over existing baseline models on several testing datasets. Specifically, our model trained on the GoPro training set achieved 32.50 on the PSNR metric on the commonly used Vimeo90k testing set.