Multimodal interfaces aim to permit natural communication by speech and gesture. Typically the speech modality bears the principal information in the interaction with gesture complementing spoken commands. A continuing challenge is how to correlate and interpret the simultaneous inputs to estimate meaning and user intent. User expertise and familiarity figure prominently in the interpretation. The present research studies the effect of user expertise on multimodal human computer interaction. Users are classified into experienced and inexperienced depending on the amount of their exposure and interaction with multimodal systems. Each user is asked to perform simple tasks using a multimodal system. For each task the automatically recognized speech input is time stamped and the lag or lead of the gesture input is computed with respect to this time stamp. The time interval around the time stamp in which all the users’ gesture inputs occur is determined. For experienced users this interval averages 56.9% less than that for inexperienced users. The implication is that for experienced users the spoken input are the corresponding gesture input are more closely related in time than for inexperienced users. This behavior can be exploited in multimodal systems to increase efficiency and reduce time of response for the system.
Read full abstract