Abstract

Multimodal Interfaces that enable natural means of interaction using multiple modalities such as touch, hand gestures, speech, and facial expressions represent a paradigm shift in human-computer interfaces. Their aim is to allow rich and intuitive multimodal interaction similar to human-to-human communication and interaction. From the multimodal system's perspective, apart from the various input modalities themselves, context information such as states of attention and activity, and identities of interacting users can help greatly in improving the interaction experience. For example, when sensors such as cameras (webcams, depth sensors etc.) and microphones are always on and continuously capturing signals in their environment, context information is very useful to distinguish genuine system-directed activity from ambient speech and gesture activity in the surroundings, and distinguish the active user from among a set of users. Information about identity may be used to personalize the system's interface and behavior -- e.g. the look of the GUI, modality recognition profiles, and information layout -- to suit the specific user. In this paper, we present a set of algorithms and an architecture that performs audiovisual analysis of context using sensors such as cameras and microphone arrays, and integrates components for lip activity and audio direction detection (speech activity), face detection and tracking (attention), and face recognition (identity). The proposed architecture allows the component data flows to be managed and fused with low latency, low memory footprint, and low CPU load, since such a system is typically required to run continuously in the background and report events of attention, activity, and identity, in real-time, to consuming applications.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call