Designing an efficient Intangible Cultural Heritage (ICH) image classification is beneficial for the public to recognize the ICH and fostering the preservation and spread of that. Currently, related ICH image classification researches mainly focus on the visual features of ICH images, ignoring attached textual descriptions. However, attached textual descriptions can provide crucial clues for ICH images classification. Therefore, in this study, we propose to combine attached textual descriptions to perform ICH image classification in a multimodal way. Additionally, to capture intra- and inter-interactions between ICH images and attached textual descriptions, we propose a novel model named MICMLF, mainly consisted of multimodal attention and hierarchical fusion. Multimodal attention is employed to make the model focus on “important regions” and “important words” in ICH image and attached textual descriptions respectively. Hierarchical fusion is utilized to capture inter-modal dynamics interactions. Extensive experiments are conducted on two Chinese nation-level ICH lists datasets, New Year Print (年画) and Clay Figurine (泥塑). Experimental results demonstrate the superiority of MICMLF, compared with several state-of-the-art methods. Also, the proposed model can handle the situation where ICH images and textual descriptions are incomplete. To the best of our knowledge, we are the first to propose to combine textual descriptions to perform ICH image classification in a multimodal way.