Understanding the expression of emotion and generating appropriate responses are key steps toward constructing emotional, conversational agents. In this paper, we propose a framework for single-turn emotional conversation generation, and there are three main components in our model, namely, a sequence-to-sequence model with stacked encoders, a conditional variational autoencoder, and conditional generative adversarial networks. For the sequence-to-sequence model with stacked encoders, we designed a two-layer encoder by combining Transformer with gated recurrent units-based neural networks. Because of the flexibility of the sequence-to-sequence model, we adopted a conditional variational autoencoder in our framework, which uses latent variables to learn a distribution over potential responses and generates diverse responses. Furthermore, we regard a conditional variational autoencoder-based, sequence-to-sequence model as the generative model, and the training of the generative model is assisted by both a content discriminator and an emotion classifier, which assists our model in promoting content information and emotion expression. We use automated evaluation and human evaluation to evaluate our model and baselines on the NTCIR short text conversation task (STC-3) Chinese emotional conversation generation (CECG) Subtask dataset [44], and the experimental results demonstrate that our proposed framework can generate semantically reasonable and emotionally appropriate responses.