A wizard of oz framework for collecting spoken human-computer dialogs

Rohit Mishra,Hua Cheng,Sandra Upson,Lawrence Cavedon,John Niekrasz,Harry Bratt,Fuliang Weng,Elizabeth Shriberg,Joyce Chen,Stanley Peters

doi:10.21437/interspeech.2004-653

Abstract

Abstract This paper describes a data collection process aimed atgathering human-computer dialogs in high-stress or “busy”do-mains where the user is concentrating on tasks other than theconversation, for example, when driving a car. Designing spo-ken dialog interfaces for suchdomains is extremely challengingand the data collected will help us improve the dialog systemin-terfaceand performance,understandhowhumansperformthesetasks with respect to stressful situations, and obtain speech ut-terances for extracting prosodic features. This paper describesthe experimental design for collecting speech data in a simu-lated driving environment. 1. Background Research in human-computer interfaces has been carried outinapplications where the useris focusedon taskssuch as driving acar [4] or operating other machinery,with the goal of designinginterfaces that will help reduce the user’s overall cognitive load.In such applications, the user normally controls several devicessimultaneously. Existing applications maintain little or no dia-log context, and require the userto learn and remember compli-cated sets of device-speciﬁc commands. To overcome some ofthe shortcomings of such systems, researchers have been inves-tigating designing spokeninterface systems which can conversewith the user more naturally, allowing more ﬂexibility in th euser’s speech and keeping track of the dialog context, similarto how a human speech partner would [6, 7]. However, human-humanspeechin suchscenariosis highly context-and situation-dependent,full of disﬂuencies(e.g., false starts and paus es)andsentence fragments (abandoned or repaired utterances), and ishighly interactive and collaborative. We believe that the easiestinterfaces to use will be those that mimic human-human inter-action in some, though perhaps not all, respects. Therefore,our data collection focuses on collecting the kind of speech thatwould occur between a human and a system that is as ﬂexibleand capable as that user would desire.Our goal is a system should mimic human-human interac-tions by understanding the user’s requests and producing re-sponsesbased on the user’s knowledge, the conversationalcon-text, and the external situation. We use the car-driving domainas a testbed ofsucha dialog interface for operating in-carequip-ment, such as obtaining navigation information (e.g., turn-by-turn instructions) and information about local points of interest.Figure 1 illustrates the systemcomponents,which include a lan-guage understanding component,a response generator, a dialogmanager and a prosody classiﬁer. We use off-the-shelf tech-nologies and tools for speechrecognition, speechsynthesis,andknowledge management.1.1. Purposes for Data CollectionThe ultimate goal of our dialog system is to enable natural in-teractions between the driver and the system to be like thosebetween humans. Therefore collecting human-human dialogsfor the above tasks helps us to develop and tune the system tosimulate such interactions. As the ﬁrst step of our system de -velopment, data collection has the following speciﬁc purpo ses.Improve the system interface and performance: Languagecoverage has been a bottleneck for existing dialog systems. Arobust dialog system should allow the user to speak freely andbe ableto understandthe user’sintention expressedthroughvar-ious utterances. The robustness of a system can only be en-hanced using a large amount of data that are expected to covermost language phenomena in the target application. Thereforewe aim to collect dialogs from many subjects and to use thesedata to train the language understanding component. The datawill also provide evidence as to what features users would de-sire in an in-car conversationalsystem.Understand how humans give navigation instructions in adriving situation: Although human navigation data has beencollected for developing systems that automatically generatenavigation instructions, e.g. [1], the data is often written de-scriptions based on the subject’s mental recap of the route. Ina driving environment,humans might chooseto give navigationinformation differently with respect to the current position ofthe vehicle (e.g., close to a turn) and external situations (e.g.,emergency stop). There is a need, therefore, to collect new datato discover what kinds of strategies humans would use to con-vey navigation information in a real-time setting.Obtain speech utterances for extracting prosody features:Drivers are likely to produce disﬂuent and distracted speec hwith potentially complex syntax when focusing on tasks otherthan talking. Such data contain rich prosodic information thatcaptures variations in timing (e.g., lengthened sounds, pauses),intonation (e.g., pitch rise/fall at the end of utterance), and loud-ness. These features convey information beyond that carried bythe words themselves. They can help a dialog system detectutterance boundaries, driver intention and stress level, and sub-sequently generate appropriate responses which take into ac-count the driver’s emotional state. They can also help augmentthe information available to a natural language parser, to help

Full Text