The present study investigated unimodal and multimodal emotion perception by humans, with an eye for applying the findings towards automated affect detection. The focus was on assessing the reliability by which untrained human observers could detect naturalistic expressions of non-basic affective states (boredom, engagement/flow, confusion, frustration, and neutral) from previously recorded videos of learners interacting with a computer tutor. The experiment manipulated three modalities to produce seven conditions: face, speech, context, face+speech, face+context, speech+context, face+speech+context. Agreement between two observers (OO) and between an observer and a learner (LO) were computed and analyzed with mixed-effects logistic regression models. The results indicated that agreement was generally low (kappas ranged from .030 to .183), but, with one exception, was greater than chance. Comparisons of overall agreement (across affective states) between the unimodal and multimodal conditions supported redundancy effects between modalities, but there were superadditive, additive, redundant, and inhibitory effects when affective states were individually considered. There was both convergence and divergence of patterns in the OO and LO data sets; however, LO models yielded lower agreement but higher multimodal effects compared to OO models. Implications of the findings for automated affect detection are discussed.