Social Scene Understanding from Social Cameras

Hyun Soo Park

doi:10.1184/r1/6722927.v1

Abstract

In social scenes, humans interact with each other by sending visible social signals, such as facial expressions, body gestures, and gaze movements. Social cognition, the ability to perceive, model, and predict such social signals, enables people to understand social interactions and to plan their behavior in accordance with the understanding. Computational social cognition is a necessary function allowing artificial agents to enter the social spaces because it enables a socially acceptable behavior. However, two key challenges preclude developing computational social cognition: (1) the core attributes of social cognition such as attention, emotion, and intent are latent quantities that cannot be directly measured by existing sensors; (2) social behaviors are interdependent to each other, i.e., a unified representation is required to understand social behavior as wholes. In this thesis, we address these challenges by establishing a computational foundation towards social scene understanding from social cameras. A social camera is a camera held or worn by a member of a social group that inherits his/her gaze behavior. This social camera is an ideal sensor to capture social signals for three reasons: (1) social cameras naturally secure the best view because the wearers or holders intelligently localize the best view point to attend to what they find interesting; (2) social cameras produce more views of events of greater interest; (3) social cameras efficiently capture socially important events by following social behaviors when the scenes are dynamic. We leverage these advantages of social cameras to understand social scenes. We present a framework to develop social cognition by perceiving social signals, modeling the relationship between them, and predicting social behaviors. Social Signal Reconstruction: Reconstructing social signals in a unified 3D coordinate system provides a computational basis to analyze social scenes, e.g., to build a model, reason about relationships, and predict social behaviors. We leverage social cameras to reconstruct three types of social signals: gaze movement, body motion, and general scene motion. (1) Gaze is a strong indicator of attentive behaviors. We model the gaze using the primary gaze direction that is emitted from the center of the eyes and aligned with the head orientation. This gaze model is reconstructed in 3D by leveraging ego- and exo-motion of social cameras. (2) Human body motion such as gestures often conveys intent of social interactions. We model skeletal motion using a set of articulated joint trajectories where the distance between the trajectories of adjacent joints remains constant. This articulation constraint in conjunction with a temporal constraint is applied to reconstruct human body motion without an activity specific prior. (3)We further relax the articulation constraint to model general scene motion occurring in social interactions. We represent a 3D trajectory using a linear combination of predefined trajectory basis vectors. We solve for the parameters of each trajectory by formulating it as a linear least squares system that allows us to reconstruct topology-independent motion and handle missing data. Social Behavior Understanding: Social behaviors are interactive by definition and therefore, an individual behavioral analysis in isolation cannot fully account for the fundamental relationship between behaviors. For instance, a social signal transmitted by one person can trigger responses in other and the responses can, in turn, affect the behavior of the person. A relational analysis between the signals is needed to characterize the social interactions. We exploit the reconstructed social signals in a unified coordinate system to understand the relationship between them. In particular, our analysis focuses on joint attention, the primary social attribute that is strongly corv related with attentive behaviors. We present a method to reconstruct 3D joint attention modeled by social charges—latent quantities that form at locations where primary gaze directions of members in a social group intersect. Inspired by the study of electric fields, we model the relationship between gaze behaviors using a gradient field induced by the social charges. This gradient field allows us to predict gaze behaviors given social charges at any location in the scene. Our overarching goal is to develop computational social cognition that will enable artificial agents to accomplish their tasks in a socially acceptable way. This thesis takes a first step towards the goal by leveraging social cameras. We present a 3D representation of social signals and based on the reconstructed signals, we build a relational model of social behaviors, which allows us to predict the behaviors. We apply our frameworks in real-world social scenes including sporting events, meetings, and parties.

Full Text