The attention mechanism has been established as an effective method for generating caption words in image captioning; it explores one noticed subregion in an image to predict a related caption word. However, even though the attention mechanism could offer accurate subregions to train a model, the learned captioner may predict wrong, especially for visual concept words, which are the most important parts to understand an image. To tackle the preceding problem, in this article we propose Visual Concept Enhanced Captioner, which employs a joint attention mechanism with visual concept samples to strengthen prediction abilities for visual concepts in image captioning. Different from traditional attention approaches that adopt one LSTM to explore one noticed subregion each time, Visual Concept Enhanced Captioner introduces multiple virtual LSTMs in parallel to simultaneously receive multiple subregions from visual concept samples. Then, the model could update parameters by jointly exploring these subregions according to a composite loss function. Technically, this joint learning is helpful in finding the common characters of a visual concept, and thus it enhances the prediction accuracy for visual concepts. Moreover, by integrating diverse visual concept samples from different domains, our model can be extended to bridge visual bias in cross-domain learning for image captioning, which saves the cost for labeling captions. Extensive experiments have been conducted on two image datasets (MSCOCO and Flickr30K), and superior results are reported when comparing to state-of-the-art approaches. It is impressive that our approach could significantly increase BLUE-1 and F1 scores, which demonstrates an accuracy improvement for visual concepts in image captioning.