Automatic video generation is a challenging research topic, attracting interests from different perspectives, including Image-to-Video generation (I2V), Video-to-Video generation (V2V), and Text-to-Video generation (T2V). To pursue more controllable and fine-grained video generation, a novel video generation task, named Text-Image-to-Video generation (TI2V), and a corresponding baseline solution, named Motion Anchor-based video Generator (MAGE), were proposed. However, two other factors, namely clean datasets and reliable evaluation metrics, also play important roles in the success of the TI2V task. In this paper, we present a complete benchmark for the TI2V task which includes synthetic video-text paired datasets, a baseline method, and two evaluation metrics. More specifically: (1) Two versions of synthetic datasets are built based on CATER containing rich combinations of objects and actions, as well as the resulting changes of brightness and shadow. We also provide both explicit and ambiguous text descriptions to support deterministic and diverse video generation, respectively. (2) A refined version of MAGE, dubbed MAGE+, is proposed with an innovative motion anchor structure to store appearance-motion aligned representation, which can be further injected with explicit condition and implicit randomness to model the uncertainty in data distribution. (3) To evaluate the quality of generated video especially given ambiguous description, we introduce action precision and referring expression precision to assess the quality of motion based on captioning-and-matching method. Experiments conducted on proposed datasets, as well as relevant datasets, verify the effectiveness of our baseline and show appealing potentials of TI2V task. Code and data are available at <uri xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">https://github.com/Youncy-Hu/MAGE</uri> .