Refer-iTTS: A System for Referring in Spoken Installments to Objects in Real-World Images

Sina Zarrieß,David Schlangen,M Soledad López Gambino

doi:10.18653/v1/w17-3509

Abstract

Current referring expression generation systems mostly deliver their output as one-shot, written expressions. We present on-going work on incremental generation of spoken expressions referring to objects in real-world images. This approach extends upon previous work using the words-as-classifier model for generation. We implement this generator in an incremental dialogue processing framework such that we can exploit an existing interface to incremental text-to-speech synthesis. Our system generates and synthesizes referring expressions while continuously observing non-verbal user reactions.

Highlights

We present Refer-iTTS, a system that is meant to support research on real-time spoken REG and builds upon recent approaches to REG from realworld images (Kazemzadeh et al, 2014; Zarrieß and Schlangen, 2016)
While generating and synthesizing the RE, the system continuously observes the non-verbal reactions of the user and adapts the generated utterances to these actions in an incremental fashion
The system tries to be as cooperative as possible: if the user shows no reaction for a certain amount of time, the previous expression is expanded, i.e. the system splits its referring expression over several utterances, which is usually known as “reference in installments”, cf. (Zarrieß and Schlangen, 2016)

Summary

Introduction

We present Refer-iTTS, a system that is meant to support research on real-time spoken REG and builds upon recent approaches to REG from realworld images (Kazemzadeh et al, 2014; Zarrieß and Schlangen, 2016). We use the recently proposed words-as-classifiers (WAC) model for generation from low-level visual inputs and integrate it with InproTk (Baumann and Schlangen, 2012b), an opensource framework for incremental dialogue processing

Results

Conclusion