Jointly Representing Images and Text

Frank Keller

doi:10.1145/2983563.2986050

Abstract

The amount of image data available on the web is growing rapidly: on Facebook alone, 350 million new images are uploaded every day. Making sense of this data requires new ways of efficiently indexing, annotating, and querying such enormous collections. Research in computer vision has tackled this problem by developing algorithms for localizing and labeling objects in images. Object classification algorithms have been recently scaled up to work on thousands of object classes based on the ImageNet database.The next frontier in analyzing images is to go beyond classifying objects: to truly understand a visual scene, we need to identify how the objects in that scene relate to each other, which actions and events they are involved in, and ultimately recognize the intentions of the actors depicted in the scene. The key to achieving this goal is to develop methods for parsing images into structured representations. A number of approaches have recently been proposed in the literature, including Visual Dependency Representations, Scene Graphs, and Scene Description Graphs. All of these models represent an image as a structured collection of objects, attributes, and relations between them.In this presentation, we will focus on Visual Dependency Representations (VDRs), the only approach to image structure that is explicitly multimodal. VDRs start from the observation that images typically do not exist in isolation, but co-occur with textual data such as comments, captions, or tags; well-established techniques exist for extracting structure from such textual data. The VDR model exploits this observation by positing an image structure that links objects through geometric relations. Text accompanying the image can be parsed into a syntactic dependency graph, and the two representations are aligned, yielding a multimodal graph (see Figure 1). Well-established synchronous parsing techniques from machine translation can be applied to this task, and resulting VDRs are useful for image description and retrieval.Parsing images into multimodal graph structures is an important step towards image understanding. However, for full understanding, representing the semantics of the image is also crucial. For example, the images in Figure 2 can all be described using the verb play (and presumably are assigned similar VDRs). However, a different meaning (verb sense) of play is evoked by each image. This has led to the new task of visual verb sense disambiguation: given an image and a verb, assign the correct sense of the verb, i.e., the one that corresponds to the action depicted in the image. We propose an unsupervised algorithm based on Lesk which performs visual sense disambiguation using textual, visual, and multimodal embeddings. In this presentation, we will discuss how the two tasks of VDR parsing and visual verb disambiguation can be combined to yield more complete syntactico-semantic image representations, which can then underpin applications such as image retrieval, image description, and visual question answering.

Full Text