This paper investigates how speech and gestures convey stress, and how they can be used for automatic stress recognition. As a first step, we look into how humans use speech and gestures to convey stress. In particular, for both speech and gestures, we distinguish between stress conveyed by the intended semantic message (e.g. spoken words for speech, symbolic meaning for gestures), and stress conveyed by the modulation of either speech and gestures (e.g. intonation for speech, speed and rhythm for gestures). As a second step, we use this decomposition of stress as an approach for automatic stress prediction. The considered components provide an intermediate representation with intrinsic meaning, which helps bridging the semantic gap between the low level sensor representation and the high level context sensitive interpretation of behavior. Our experiments are run on an audiovisual dataset with service-desk interactions. The final goal is having a surveillance system that would notify when the stress level is high and extra assistance is needed. We find that speech modulation is the best performing intermediate level variable for automatic stress prediction. Using gestures increases the performance and is mostly beneficial when speech is lacking. The two-stage approach with intermediate variables performs better than baseline feature level or decision level fusion.