The performance of zero-shot learning (ZSL) can be improved progressively by learning better features and generating pseudosamples for unseen classes. Existing ZSL works typically learn feature extractors and generators independently, which may shift the unseen samples away from their real distribution and suffers from the domain bias problem. In this article, to tackle this challenge, we propose a variational autoencoder (VAE)-based framework, that is, joint Attentive Region Embedding with Enhanced Semantics (AREES), which is tailored to advance the zero-shot recognition. Specifically, AREES is end-to-end trainable and consists of three network branches: 1) attentive region embedding is used to learn the semantic-guided visual features by the attention mechanism (AM); 2) a decomposition structure and a semantic pivot regularization are used to extract enhanced semantics; and 3) a multimodal VAE (mVAE) with the cross-reconstruction loss and the distribution alignment loss is used to obtain a shared latent embedding space of visual features and semantics. Finally, features' extraction and features' generation are optimized together in AREES to address the domain shift problem to a large extent. The comprehensive evaluations on six benchmarks, including the ImageNet, demonstrate the superiority of the proposed model over its state-of-the-art counterparts.