During communication in real‐life settings, the brain integrates information from auditory and visual modalities to form a unified percept of our environment. In the current magnetoencephalography (MEG) study, we used rapid invisible frequency tagging (RIFT) to generate steady‐state evoked fields and investigated the integration of audiovisual information in a semantic context. We presented participants with videos of an actress uttering action verbs (auditory; tagged at 61 Hz) accompanied by a gesture (visual; tagged at 68 Hz, using a projector with a 1,440 Hz refresh rate). Integration difficulty was manipulated by lower‐order auditory factors (clear/degraded speech) and higher‐order visual factors (congruent/incongruent gesture). We identified MEG spectral peaks at the individual (61/68 Hz) tagging frequencies. We furthermore observed a peak at the intermodulation frequency of the auditory and visually tagged signals (fvisual − fauditory = 7 Hz), specifically when lower‐order integration was easiest because signal quality was optimal. This intermodulation peak is a signature of nonlinear audiovisual integration, and was strongest in left inferior frontal gyrus and left temporal regions; areas known to be involved in speech‐gesture integration. The enhanced power at the intermodulation frequency thus reflects the ease of lower‐order audiovisual integration and demonstrates that speech‐gesture information interacts in higher‐order language areas. Furthermore, we provide a proof‐of‐principle of the use of RIFT to study the integration of audiovisual stimuli, in relation to, for instance, semantic context.