Dialogue generation is an important research direction in natural language generation, and generating utterances with appropriate contextual emotion is a challenging task. The previous work incorporates commonsense knowledge as an auxiliary information source by simply concatenating or adding it to the embeddings of the dialogue context. They did not consider updating the commonsense knowledge from the perspective of context semantic information or selecting context from the perspective of commonsense knowledge semantic information. At the same time, they did not extract emotional information from dialogue context and commonsense knowledge to enhance the embeddings of the dialogue text. These defects may lead to the dialogue model’s inability to judge the semantic consistency between context and generated utterance, and result in the generation of utterance unassociated with the dialogue context and the corresponding context emotion being untrue. Our goal is to build an emotional dialogue generation model that can deeply integrate commonsense knowledge with dialogue context and achieve bidirectional semantic interaction and enhancement. Therefore, in this paper, we propose a context multi-aware self-attention emotion dialogue generation model that models from multiple aspects such as the semantic information, emotional information of context and external knowledge, it effectively alleviates the problems of generating utterance unrelated to the context and disordered emotional expressions in utterance. First, we adopt the transformer encoder as the basic framework and carefully construct a context-aware encoder capable of deeply extracting semantic information. Specifically, we ingeniously integrate different levels of positional information to simulate the natural flow of conversation between speakers. Then, we effectively integrate these positional information with the basic word embedding vectors, and enrich the dialogue’s embedding representation. Lastly, by introducing the attention mechanism, the model can precisely capture and extract context information closely related to the current situation and achieve a deep understanding and accurate grasp of the dialogue content. Second, we consider that the relationship between context and commonsense knowledge is not just a unidirectional selection but a more intimate bidirectional interaction. Therefore, we designed a bidirectional interaction attention component for context and external knowledge. Specifically, we use the method of iterative similarity matrices to calculate the bidirectional similarity relationship between context and external knowledge. Then, using the attention algorithm, we calculate the connectivity from context to commonsense knowledge and from commonsense knowledge to context. Finally, by combining the semantic embeddings of both aspects, we achieve bidirectional updating of context and external knowledge. Third, we establish an emotion-aware decoder based on multi-task learning, which includes an emotion classifier and an utterance generator. In the emotion classifier, we establish an auxiliary task for emotion recognition to help the model understand the process of emotional transition. We use the output of the emotion recognition model as an additional supervisory signal to better control the generation of emotional responses. In the utterance generator, we integrate the output of the emotion classifier with the interaction vector from the bidirectional interaction attention component. This fusion allows us to generate utterance that aligns with the emotional context. Our experiments on DailyDialog and ESConv datasets show that CoMaSa outperforms the baselines in terms of Perplexity, Distinct-1, Distinct-2, and human evaluation. The experiment demonstrates the effectiveness of our Context Multi-aware Self-attention model in the task of emotion generation.