Abstract

Multimedia integrating different modalities such as text, image, and video, provides users with much convenience in the digital era. Researchers have been establishing multimedia infrastructure over the recent decades, and nowadays multimedia content can be delivered to almost everyone and everywhere. With the fast development of the media world, the multimedia research community has focused attention on multimedia content analytics, which aims to recognise and represent semantic information from various data sources and content types. Vision and language are two representative content forms out of various multimedia formats. This dissertation investigates the interactions between vision and language modalities to enhance the comprehension ability of multimedia content analytics methods. The main challenges of multimedia content analytics come from the feature representations of visual and textual content and the intrinsic modality gap between them, as well as the time-consuming training process. From a visual content aspect, although a convolutional neural network based model can extract visual features that are effective for conventional computer vision tasks such as image classification, the learned representations have limitations when generalising to advanced visual comprehension tasks including image captioning and visual dialogue. On the other hand, from a language aspect, the language models learn word embeddings for textual content representation and generation. To generate high quality text, an image caption for instance, the model must be trained with high quality data. However, the quality of the training data cannot be guaranteed, and imperfect annotations will inevitably lead to output of subpar quality. In addition, modality transition and model training efficiency problems are worth investigating to further enhance the model usability. To address the aforementioned challenges, this dissertation concentrates on model effectiveness and efficiency. Firstly, the depth map and scene graph are exploited to enhance the visual representations derived from the image. Chapter 2 introduces the depth-aware attention model for image paragraph captioning. A depth map is estimated to augment visual cues for more accurate, logical and diverse paragraph generation. The object relationships are discovered for the visual dialogue model in Chapter 3. The objects and their interactions are extracted from the image to form a scene graph, and then the graph structure is preserved in a novel hierarchical graph convolutional network. The dialogue reasoning module can then benefit from the comprehensive visual features extracted via this process. Secondly, the effectiveness of the language model is investigated in Chapter 4. A number of annotation quality issues are identified in the image caption training data collected from an online crowd-sourcing platform. The human-consensus loss is proposed to allow the model to learn from training data that includes imperfect annotations by setting a high priority training focus on high quality annotations. Thirdly, the modality gap between vision and language is explicitly addressed with a modality transition module in Chapter 5. The proposed modality transition module ensures a smooth transition from visual features to semantic embeddings for more precise and context-aware caption generation. Lastly, Chapter 6 considers the training efficiency of the image captioning model. The training deficiency is addressed with a well-engineered attention mechanism which can be trained in parallel. The training time is, therefore, significantly reduced, whilst maintaining competitive model performance.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call