Conventional studies on the satisfaction of museum visitors focus on collecting information through surveys to provide a one-way service to visitors, and thus it is impossible to obtain feedback on the real-time satisfaction of visitors who are experiencing the museum exhibition program. In addition, museum practitioners lack research on automated ways to evaluate a produced content program's lifecycle and its appropriateness. To overcome these problems, we propose a novel multi-convolutional neural network, called VimoNet, which is able to recognize visitors emotions automatically in real-time based on their facial expressions and body gestures. Furthermore, we design a user preference model of content and a framework to obtain feedback on content improvement for providing personalized digital cultural heritage content to visitors. Specifically, we define seven emotions of visitors and build a dataset of visitor facial expressions and gestures with respect to the emotions. Using the dataset, we proceed with feature fusion of face and gesture images trained on the DenseNet-201 and VGG-16 models for generating a combined emotion recognition model. From the results of the experiment, VimoNet achieved a classification accuracy of 84.10%, providing 7.60% and 14.31% improvement, respectively, over a single face and body gesture-based method of emotion classification performance. It is thus possible to automatically capture the emotions of museum visitors via VimoNet, and we confirm its feasibility through a case study with respect to digital content of cultural heritage.