Deep neural networks are usually data-starved in real-world applications, while manually annotation can be costly—for example, the audio emotion recognition from the audio. In contrast, the continued research in image-based facial expression recognition grants us a rich source of public available labeled IFER datasets. Using images to support audio emotion recognition with limited labeled data according to their inherent correlations can be a meaningful and challenging task. This paper proposes a system that facilitates knowledge transfer from the labeled visual to the heterogeneous labeled audio domain by learning a joint distribution of examples in different modalities then the system can map an IFER example to a corresponding audio spectrogram. Next, our work reformulates the audio emotion classification into a K+1 class discriminator of GAN-based semi-supervised learning. Good semi-supervised learning requires that the generator does NOT sample from a distribution well matching the true data distribution. Therefore, we demand the generated examples are from the low-density areas of the marginal distribution in the audio spectrogram modality. Concretely, the proposed model translates image samples to audios class-wisely in the form of spectrograms. To harness the decoded samples in a sparsely distributed area and construct a tighter decision boundary, we give a solution to precisely estimate the density on feature space and incorporate low-density pieces with an annealing scheme. Our method requires the network to discriminate against the low-density data points from high-density data points throughout the classification, and we evidence that this technique effectively improves task performance.