Amidst the rapid propagation of images across social media platforms, the representation learning of social images has emerged as a prime research pursuit. A salient characteristic of social images lies in their multimodal properties, encompassing the visual content and textual descriptions present within individual images and the social relationships among different images. Notably, the content information within individual images and the structural information among different images exhibit a hierarchical nature. However, existing endeavors generally employ a flat framework, leading to suboptimal utilization of hierarchical relationships. Furthermore, the heterogeneity inherent in data content and structural information presents additional challenges in social image representation. In light of these challenges, we propose a novel Hierarchical Heterogeneous Graph Neural Network model for Social Image Representation learning, dubbed HHGSI. Our motivation lies in exploring and exploiting the hierarchical relationship between diverse modalities through designing the hierarchical heterogeneous network framework. HHGSI consists of an Intra-node Multimodal Graph Encoder and an Inter-node Heterogeneous Graph Neural Network to simultaneously capture fine-grained correlation within the image and heterogeneous relationships among the images. Moreover, a task-independent optimization objective is designed to make the model suitable for numerous network-oriented and multimodal tasks. Our proposal is extensively evaluated over four real-world datasets, and experimental results demonstrate the superiority of our proposal. Our code is publicly available in https://github.com/multimodal-code/HHGSI.