Recently, image captioning has achieved great progress in computer vision and artificial intelligence. However, language models still failed to achieve the desired results in high-level visual tasks. Generating accurate image captions for a complex scene that contains multiple targets is a challenge. To solve these problems, we introduce the theory of attention in psychology to image caption generation. We propose two types of attention mechanisms: The stimulus-driven and the concept-driven. Our attention model relies on a combination of convolutional neural network (CNN) over images and long-short term memory (LSTM) network over sentences. Comparison of experimental results illustrates that our proposed method achieves good performance on the MSCOCO test server.