Abstract
This paper proposed a visual embeddings approach to capturing semantic relatedness between visual words. To be specific, visual words are extracted and collected from a word image collection under the Bag-of-Visual-Words framework. And then, a deep learning procedure is used for mapping visual words into embedding vectors in a semantic space. To integrate spatial constraints into the representation of word images, one word image is segmented into several sub-regions with equal size along rows and columns. After that, each sub-region can be represented as an average of embedding vectors, which is the centroid of the embedding vectors of all visual words within the same sub-region. By this way, one word image can be converted into a fixed-length vector by concatenating the corresponding average embedding vectors from its all sub-regions. Euclidean distance can be calculated to measure similarity between word images. Experimental results demonstrate that the proposed representation approach outperforms Bag-of-Visual-Words, visual language model, spatial pyramid matching, latent Dirichlet allocation, average visual word embeddings and recurrent neural network.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.