The effect of video captions on vocabulary acquisition is a growing topic of interest among L2 researchers. Expanding on this line of inquiry, this study explores how keyword captions enhanced with first language (L1) and second language (L2) glosses (i.e. definitions or brief explanations of unfamiliar words) impact L2 learners’ vocabulary acquisition. This study involved 101 Korean undergraduate students randomly assigned to a baseline (keyword caption only), L1 gloss (keyword caption + L1 gloss), or L2 gloss (keyword caption + L2 gloss) group. The participants viewed a video that corresponded to their group, during which their eye movements were tracked and recorded. Subsequently, the participants completed two vocabulary tests assessing word form and meaning recall. Linear mixed-effects analysis revealed that keyword captions with L1 and L2 glosses resulted in distinct attentional allocation and contributed differently to lexical acquisition. Notably, although the three groups spent similar durations reading the target word forms, the L1 gloss group significantly outperformed the other two groups in word form recall scores. Furthermore, although the L2 gloss group spent three times longer reading L2 glosses compared to the time the L1 gloss group spent reading L1 glosses, the latter had significantly higher scores in the form translation test. These findings have significant implications for L2 pedagogy, particularly regarding the choice between target-language-exclusive or bilingual approaches.