The availability of virtual reality (VR) tools and the attractiveness of their interactivity for users have influenced the spread of this technology in the field of language educational programs. However, scientific studies of the effectiveness of VR in comparison with traditional methods are few and show conflicting results, which can be explained by the influence of additional factors, such as the learner's motor activity and the speech learning strategies used. In order to clarify the role of these factors, we developed a natural design of learning tasks in VR on a computer monitor, based on the auditory presentation of new words to participants together with their visual referents in the context of interrogative sentences. The controlled variables were speech learning strategies (Fast Mapping/Explicit Encoding) and motor response to the question (high-velocity low amplitude hand movements). 16 respondents learned 8 nouns each in the two learning environments. Learning outcomes were assessed using the recognition task. Accuracy of the answers was analyzed using RM-ANOVA, the reaction time using the Wilcoxon test. The correctness of recognition of new words did not differ significantly after using VR (55%) or a computer monitor (61%). Words learned with high-velocity whole-hand movements were significantly better for participants when they learned through fast mapping, while words learned with low amplitude finger movements were significantly better with explicit encoding while using a computer monitor. Explicit Encoding learned words with small amplitude movements were recognized faster using VR then a computer monitor. The pilot study showed the effectiveness of the semantic assimilation of new words in both learning environments with the combined influence of speech learning strategies and the student's motor activity in this process.