At present, many exciting results have been achieved in the application of deep learning to image recognition. However, there are still many problems to be overcome before deep learning is used in practical applications such as image retrieval, image annotation, and image-text conversion. This paper studies the structure of deep learning, improves the commonly used training algorithms, and proposes two new neural network models for different application scenarios. This paper uses Support Vector Machine (SVM) as the main classifier for Internet of Things image recognition and uses the database of this paper to train SVM and CNN. At the same time, the effectiveness of the two for image recognition is tested, and the trained classifier is used for image recognition. The result surface: In the labeled data set, the rank-1 accuracy of CNN is 85.77%, which is higher than 90.28% of the SVM method. In the detection data, CNN’s rank-1 accuracy rate is 83.11%, which also exceeds SVM’s 80.22%. SVM+CNN has a rank 1 value of 84.69% for the detection data set. This shows that deep learning can map the feature representation of the image and the feature representation of the word to the same space, making the calculation of the similarity and correlation between the image and the text easier and more straightforward.