Abstract
In this paper, we target the tasks of fine-grained image–text alignment and cross-modal retrieval in the cultural heritage domain as follows: (1) given an image fragment of an artwork, we retrieve the noun phrases that describe it; (2) given a noun phrase artifact attribute, we retrieve the corresponding image fragment it specifies. To this end, we propose a weakly supervised alignment model where the correspondence between the input training visual and textual fragments is not known but their corresponding units that refer to the same artwork are treated as a positive pair. The model exploits the latent alignment between fragments across modalities using attention mechanisms by first projecting them into a shared common semantic space; the model is then trained by increasing the image–text similarity of the positive pair in the common space. During this process, we encode the inputs of our model with hierarchical encodings and remove irrelevant fragments with different indicator functions. We also study techniques to augment the limited training data with synthetic relevant textual fragments and transformed image fragments. The model is later fine-tuned by a limited set of small-scale image–text fragment pairs. We rank the test image fragments and noun phrases by their intermodal similarity in the learned common space. Extensive experiments demonstrate that our proposed models outperform two state-of-the-art methods adapted to fine-grained cross-modal retrieval of cultural items for two benchmark datasets.
Highlights
IntroductionIn this scenario, the cross-modal search of artwork plays an important role in facilitating the interaction between online art users and cultural objects
With the rapid progress of digitization, millions of cultural items have been featured on websites such as Europeanaand the online source of the Metropolitan Museum of Art
We focus on the tasks of fine-grained image–text alignment and cross-modal retrieval in the cultural heritage domain
Summary
In this scenario, the cross-modal search of artwork plays an important role in facilitating the interaction between online art users and cultural objects. Cross-modal retrieval takes one type of data as the query to retrieve relevant data of another type. Previous works on the cross-modal retrieval of artwork items [1,2,3,4,5] focus on the coarse-grained full-image and text levels, while this work pushes cross-modal retrieval further to the fine-grained fragment level to make it easier for online art users to obtain detailed information on cultural objects. In addition to the benefits for online art users, our research could assist offline visitors in physical museums by searching the related noun phrases for a picture of an artwork fragment and vice versa
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.