Abstract Japanese pitch accent is phonemic, making it crucial for second-language learners to acquire. Building on theories of multimodal learning, the present study explored how auditory, visual and gestural training of Japanese pitch accent affected behavioral, neural and meta-cognitive aspects of pitch perception across two experiments. Experiment 1 used a between-subjects pre/posttest design to train native English speakers to perceive Japanese pitch accents in one of the following three conditions: (1) baseline (audio + flat notation), (2) pitch height notation (audio + notation mimicking pitch height) and (3) pitch height notation + a left-hand gesture (L-gesture) (to engage the contralateral right hemisphere specialized for suprasegmental pitch processing). Our results indicated that (2) pitch height notation training was most robust in its benefits, as participants in this condition improved on trained and novel words alike. Experiment 2 used a within-subjects design to extend Experiment 1 in three ways: adding a right-hand gesture (R-gesture) condition (to engage more segmental language areas in the left hemisphere), introducing a neural correlate of cognitive load (measured by EEG alpha and theta power) and performing a metacognitive subjective assessment of learning (e.g., ‘Which training did you find the most helpful?’). The results showed that although there were no differences among our four training conditions on learning outcomes or EEG power, participants made the most positive subjective evaluations about pitch height notation and R-gesture training. Together, the results suggest that there may be a ‘just right’ amount of multimodal instruction to boost learning and increase engagement during foreign language pitch instruction.