With the development of social media, data are becoming increasingly diversified, including images, texts and videos, which also makes it possible to use multimodal information to predict popularity. To effectively predict the popularity of specific posts in social networks, we propose a social media popularity prediction framework based on a hierarchical fusion model. We first perform feature extraction on four kinds of information for two modalities (image, image attribute, text and text attribute). Here, we extract image features with Residual Nets (ResNet) and generate image attribute features with an image attribute predictor. We use the image attribute features, Global Vectors for Word Representation (GloVe) embeddings and the bi-directional Long Short-Term Memory (LSTM) model to obtain text features, and the text attribute features are added through data preprocessing. Then, the features of the kinds of four information are reconstructed with our multimodal hierarchical fusion model and merged into a feature vector for popularity prediction. Experimental results show that our hierarchical fusion method has excellent performance for popularity prediction on two real-world datasets.