Recently, as the demand for non-face-to-face counseling has rapidly increased, the need for emotion recognition technology that combines various aspects such as text, voice, and facial expressions is being emphasized. In this paper, we address issues such as the dominance of non-Korean data and the imbalance of emotion labels in existing datasets like FER-2013, CK+, and AFEW by using Korean video data. We propose methods to enhance multimodal emotion recognition performance in videos by integrating the strengths of image modality with text modality. A pre-trained model is used to overcome the limitations caused by small training data. A GPT-4-based LLM model is applied to text, and a pre-trained model based on VGG-19 architecture is fine-tuned to facial expression images. The method of extracting representative emotions by combining the emotional results of each aspect extracted using a pre-trained model is as follows. Emotion information extracted from text was combined with facial expression changes in a video. If there was a sentiment mismatch between the text and the image, we applied a threshold that prioritized the text-based sentiment if it was deemed trustworthy. Additionally, as a result of adjusting representative emotions using emotion distribution information for each frame, performance was improved by 19% based on F1-Score compared to the existing method that used average emotion values for each frame.