Abstract

Human affect sensing can be obtained from a broad range of behavioral cues and signals that are available via visual, acoustic, and tactual expressions or presentations of emotions. Affective states can thus be recognized from visible/external signals such as gestures (e.g., facial expressions, body gestures, head movements, etc.), and speech (e.g., parameters such as pitch, energy, frequency and duration), or invisible/internal signals such as physiological signals (e.g., heart rate, skin conductivity, salivation, etc.), brain and scalp signals, and thermal infrared imagery. Despite the available range of cues and modalities in human-human interaction (HHI), the mainstream emotion research has mostly focused on facial expressions (Hadjikhani & De Gelder, 2003). In line with this, most of the past research on affect sensing and recognition has also focused on facial expressions and on data that has been posed on demand or acquired in laboratory settings. Additionally, each sense such as vision, hearing, and touch has been considered in isolation. However, natural human-human interaction is multimodal and not occurring in predetermined, restricted and controlled settings. In the day-to-day world people do not present themselves to others as voiceor body-less faces or faceor body-less voices (Walker-Andrews, 1997). Moreover, the available emotional signals such as facial expression, head movement, hand gestures, and voice are unified in space and time (see Figure 1). They inherently share the same spatial location, and their occurrences are temporally synchronized. Cognitive neuroscience research thus claims that information coming from various modalities is combined in our brains to yield multimodally determined percepts (Driver & Spence, 2000). In real life situations, our different senses receive correlated information about the same external event. When assessing each others’ emotional or affective state, we are capable of handling significantly variable conditions in terms of viewpoint (i.e. frontal, profile, even back view), tilt angle, distance (i.e., face to face as well as at a distance) , illumination (i.e., both day and night conditions), occlusions (e.g., even when some body parts are occluded), motion (e.g., both when stationary and moving, walking and talking) and noise (e.g., while many people are chatting and interacting simultaneously).

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.