Abstract

Feature representation learning is an important and fundamental task in multimedia and pattern recognition research. In this paper, we propose a novel framework to explore the hierarchical structure inside the images from the perspective of feature representation learning, which is applied to hierarchical image annotation. Different from the current trend in multimedia analysis of using pre-defined features or focusing on the end-task flat representation, we propose a novel layer-wise tag-embedded deep learning (LTDL) model to learn hierarchical features which correspond to hierarchical semantic structures in the tag hierarchy. Unlike most existing deep learning models, LTDL utilizes both the visual content of the image and the hierarchical information of associated social tags. In the training stage, the two kinds of information are fused in a bottom-up way. Supervised training and multi-modal fusion alternate in a layer-wise way to learn feature hierarchies. To validate the effectiveness of LTDL, we conduct extensive experiments for hierarchical image annotation on a large-scale public dataset. Experimental results show that the proposed LTDL can learn representative features with improved performances.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.