Multimodal communication may evolve because different signals may convey information about the signaller (content-based selection), increase efficacy of signal processing or transmission through the environment (efficacy-based selection), or modify the production of a signal or the receiver's response to it (inter-signal interaction selection). To understand the function of a multimodal signal (aggressive calls+toe flags) emitted by males of the frog Crossodactylus schmidti during territorial contests, we tested two hypotheses related to content-based selection (quality and redundant signal), one related to efficacy-based selection (efficacy backup), and one related to inter-signal interaction selection (context). For each hypothesis we derived unique predictions based on the biology of the study species. In a natural setting, we exposed resident males to a robot frog simulating aggressive calls (acoustic stimulus) and toe flags (visual stimulus), combined and in isolation, and measured quality-related traits from males and local levels of background noise and light intensity. Our results provide support to the context hypothesis, as toe flags (the context signal) are insufficient to elicit a receiver's response on their own. However, when toe flags are emitted together with aggressive calls, they evoke in the receiver qualitatively and quantitatively different responses from that evoked by aggressive calls alone. In contrast, we found no evidence that toe flags and aggressive calls provide complementary or redundant information about male quality, which are key predictions of the quality and redundant signal hypotheses respectively. Finally, the multimodal signal did not increase the receiver's response across natural gradients of light and background noise, a key prediction of the efficacy backup hypothesis. Toe flags accompanying aggressive calls seem to provide contextual information that modify the receiver's response in territorial contests. We suggest this contextual information is increased motivation to escalate the contest, and discuss the benefits to the signallers and receivers of adding a contextual signal to the aggressive display. Examples of context-dependent multimodal signals are rare in the literature, probably because most studies focus on single hypotheses assuming content- or efficacy-based selection. Our study highlights the importance of considering multiple selective pressures when testing multimodal signal function.