Although image recognition technologies are developing rapidly with deep learning, conventional recognition models trained by supervised learning with class labels do not work well when test inputs from untrained classes are given. For example, a recognizer trained to classify Asian bird species cannot recognize the species of kiwi, because the class label “kiwi” and its image samples have not been seen during training. To overcome this limitation, zero-shot classification has been studied recently, and the joint-embedding-based approach has been suggested as one of the promised solutions. In this approach, image features and text descriptions belonging to the same class are trained to be closely located in a common joint-embedding space. Once we obtain the embedding function that can represent the semantic relationship of image–text pairs in training data, test images and text descriptions (prototypes) of unseen classes can also be mapped to the joint-embedding space for classification. The main challenge with this approach is mapping inputs of two different modalities into a common space, and previous works suffer from the inconsistency between the distribution of two feature sets on joint-embedding space extracted from the heterogeneous inputs. To treat this problem, we propose a novel method of employing additional textual information to rectify the visual representation of input images. Since the conceptual information of test classes is generally given as texts, we expect that the additional descriptions from a caption generator can adjust the visual feature for better matching with the representation of the test classes. We also propose to use the generated textual descriptions to augment training samples for learning joint-embedding space. In the experiments on two benchmark datasets, the proposed method shows significant performance improvements of 1.4% on the CUB dataset and 5.5% on the flower dataset, in comparison to existing models.