This study investigated how video caption type affected vocabulary learning and listening comprehension of low-intermediate Chinese-speaking learners of English. Each video was presented twice with one of the five caption types: (1) no caption (NC), (2) full caption with no audio (FCNA), (3) full caption (FC), (4) full caption with highlighted target-word (FCHTW), and (5) full caption with highlighted target-word and L1 gloss (FCL1), where gloss was presented simultaneously with full caption. The results showed that caption type did affect vocabulary learning. FCL1 facilitated the learning of both word form and meaning in a multimedia listening activity. FCHTW increased attention to word form at the expense of word meaning. Videos with either captions (FCNA) or audio (NC) were not helpful for the learning of written words, indicating that presentation of verbal information through two modalities (audio plus text) was superior over single-modality presentation. While caption type had no impact on listening comprehension, concurrent presentations of video, audio, and captions did not overload the learners in the FC condition, suggesting that selective attention might be allocated to different parts of the visual stimuli during the first and second exposure to the videos. Additionally, the presence of highlighted words and glosses in the captioning line might direct learner attention to vocabulary rather than video content.