Deep convolutional networks and recurrent neural networks have gained significant popularity in the field of image captioning tasks in recent times. As we all know the performance and the architecture of models are still eternal topic. We constructed the model using a new method to enhance its performance and accuracy. In our model, we make use of pretrained CNN model VGG (Visual Geometry Group) to extract image features, and learn caption sentence features using bidirectional LSTM(Long-Short-Term-Memory) which can better understand the meaning of sentences in the text. Then we combine the image features and caption features to predict captions for images. The dataset Flickr8K is used to train and test the model. Additionally, the model has the ability to produce captions that are shorter than a specified caption length. We evaluated our model with Bilingual Evaluation Understudy (BLEU) score which measures the similarity of predicted text and to the real text. After evaluation and comparison, our model is proved to be well-done on some conditions.