Visual emotion classification is predicting emotional reactions of people for the given visual content. Psychological studies show that human emotions are affected by various visual stimuli from low level to high level, including contrast, color, texture, scene, object, and association, among others. Traditional approaches regarded different levels of stimuli as independent components and ignored to effectively fuse different stimuli. This article proposes a hierarchical convolutional neural network (CNN)-recurrent neural network (RNN) approach to predict the emotion based on the fused stimuli by exploiting the dependency among different-level features. First, we introduce a dual CNN to extract different levels of visual stimulus, where two related loss functions are designed to learn the stimuli representation under a multi-task learning structure. Further, to model the dependency between the low- and high-level stimulus, a stacked bi-directional RNN is proposed to fuse the preceding learned features from the dual CNN. Comparison experiments on one large-scale and three small scale datasets show that the proposed approach brings significant improvement. Ablation experiments demonstrate the effectiveness of different modules from our model.